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parties. 



First the conditions must be verified. Because this is a random sample from less than 
10% of the population, the observations are independent, both within the samples 
and between the samples. The success-failure condition also holds using the sample 
proportions (for each sample) 1 ' . Because our conditions are met, the normal model 
can be used for the point estimate of the difference in support: 

p D -p R = 0.82 - 0.20 = 0.62 

The standard error may be computed using Equation (5.23) with each sample pro- 
portion: 

SE « >82d- 0-82) 020(1^0^ = 
V 325 172 

For a 90% confidence interval, we use z* = 1.65: 

point estimate ± z*SE -> 0.62 ± 1.65 * 0.037 -> (0.56, 0.68) 

We are 90% confident that the difference in support for healthcare action between 
the two parties is between 56% and 68%. Healthcare is a very partisan issue, which 
may not be a surprise to anyone who follows the ongoing health care debates. 



lf *http:/ /www. gallup.com/poll/125030/Hcalthcarc-Bill-Support-Ticks-Up-Public-Dividcd.aspx 
1 ' Sometimes for the two proportion case, the success-failure threshold is lowered to 5. In this book, we 
will still use 10. 
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Q Exercise 5.25 A remote control car company is considering a new manufacturer for 
wheel gears. The new manufacturer would be more expensive but their higher quality 
gears are more reliable, resulting in happier customers and fewer warranty claims. 
However, management must be convinced that the more expensive gears are worth 
the conversion before they approve the switch. If there is strong evidence that more 
than a 3% improvement in the percent of gears that pass inspection, management 
says they will switch suppliers, otherwise they will maintain the current supplier. 
Setup appropriate hypotheses for the test. Answer in the footnote 18 . 

0 Example 5.26 The quality control engineer from Exercise 5.25 collects a sample 
of gears, examining 1000 gears from each company and finds that 899 gears pass 
inspection from the current supplier and 958 pass inspection from the prospective 
supplier. Using these data, evaluate the hypothesis setup of Exercise 5.25 using a 
significance level of 5%. 

First, we check the conditions. The sample is not necessarily random, so to pro- 
ceed we must assume the gears are all independent; for this sample we will suppose 
this assumption is reasonable, but the engineer would be more knowledgeable as to 
whether this assumption is appropriate. The success-failure condition also holds for 
each sample. Thus, the difference in sample proportions, 0.958 — 0.899 = 0.059, can 
be said to come from a nearly normal distribution. 

The standard error can be found using Equation (5.23): 



In this hypothesis test, the sample proportions were used. We will discuss this choice 
more in Section 5.4.3. 

Next, we compute the test statistic and use it to find the p-value, which is depicted 
in Figure 5.12. 



Using the normal model for this test statistic, we identify the right tail area as 0.006. 
Since this is a one-sided test, this single tail area is also the p-value, and we reject 
the null hypothesis because 0.006 is less than 0.05. That is, we have statistically 
significant evidence that the higher quality gears actually do pass inspection more 
than 3% as often as the currently used gears. Based on these results, management 
will approve the switch to the new supplier. 



5.4.3 Hypothesis testing when Hq : p\ = p2 

Here we use a new example to examine a special estimate of standard error when Hq : p± = 
P2- We investigate whether there is an increased risk of cancer in dogs that are exposed 
to the herbicide 2,4-dichlorophenoxyacetic acid (2,4-D). A study in 1994 examined 491 

1s Hq: The higher quality gears will pass inspection no more than 3% more frequently than the standard 
quality gears. PhighQ ~ Patandard = 0.03. H^ : The higher quality gears will pass inspection more than 3% 
more often than the standard quality gears. PhighQ ~ Patandard > 0.03. 




z = 



point estimate — null value 0.059 — 0.03 



= 2.54 



SE 0.0114 



5.4. DIFFERENCE OF TWO PROPORTIONS 
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0.03 0.059 
(null value) 



Figure 5.12: Distribution of the test statistic if the null hypothesis was true. 
The p-value is represented by the shaded area. 

dogs that had developed cancer and 945 dogs as a control group 19 . Of these two groups, 
researchers identified which dogs had been exposed to 2,4-D in their owner's yard. The 
results are shown in Table 5.13. 





cancer 


noCancer 


2,4-D 


191 


304 


no 2,4-D 


300 


641 



Table 5.13: Summary results for cancer in dogs and the use of 2,4-D by the 
dog's owner. 

Q Exercise 5.27 Is this study an experiment or an observational study? Answer in 
the footnote 20 . 

Q Exercise 5.28 Set up hypotheses to test whether 2,4-D and the occurrence of 
cancer in dogs are related. Use a one-sided test and compare across the cancer and 
no cancer groups. Comment and answer in the footnote 21 . 

0 Example 5.29 Are the conditions met to use the normal model and make inference 
on the results? 



(1) It is unclear whether this is a random sample. However, if we believe the dogs in 
both the cancer and no cancer groups are representative of each respective population 

19 Hayes HM, Tarone RE, Cantor KP, Jessen CR, McCurnin DM, and Richardson RC. 1991. Case- 
Control Study of Canine Malignant Lymphoma: Positive Association With Dog Owner's Use of 2, 4- 
Dichlorophenoxyacetic Acid Herbicides. Journal of the National Cancer Institute 83(17):1226-1231. 

20 The owners were not instructed to apply or not apply the herbicide, so this is an observational study. 
This question was especially tricky because one group was called the control group, which is a term usually 
seen in experiments. 

21 Using the proportions within the cancer and noCancer groups rather than examining the rates for 
cancer in the 2,4-D and no 2,4-D groups may seem odd; we might prefer to condition on the use of the 
herbicide, which is an explanatory variable in this case. However, the cancer rates in each group do not 
necessarily reflect the cancer rates in reality due to the way the data were collected. For this reason, 
computing cancer rates may greatly alarm dog owners. 

Hq: the proportion of dogs with exposure to 2,4-D is the same in the cancer and noCancer groups (p c —pn = 
0). 

Ha '■ the dogs with cancer are more likely to have been exposed to 2,4-D than dogs without cancer (p c —p n > 
0). 
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and that the dogs in the study do not interact in any way, then we may find it 
reasonable to assume independence between observations holds. (2) The success- 
failure condition holds for each sample. 

Under the assumption of independence, we can use the normal model and make 
statements regarding the canine population based on the data. 



In your hypotheses for Exercise 5.28, the null is that the proportion of dogs with 
exposure to 2,4-D is the same in each group. The point estimate of the difference in sample 
proportions is p c — p n = 0.067. To identify the p- value for this test, we first check conditions 
(Example 5.29) and compute the standard error of the difference: 



SE 



iPc(l-Pc) , Pn(l - Pn) 



In a hypothesis test, the distribution of the test statistic is always examined as though the 
null hypothesis was true, i.e. in this case, p c = p n . The standard error formula should 
reflect this equality in the null hypothesis. We will use p to represent the common rate of 
dogs that are exposed to 2,4-D in the two groups: 



SE 



lp(l-p) p(l -p) 



We don't know the exposure rate, p, but we can obtain a good estimate of it by pooling 
the results of both samples: 

„ _ # of "successes" _ 191 + 304 _ ^ 

P ~ # of cases ~~ 191 + 300 + 304 + 641 ~ 

This is called the pooled estimate of the sample proportion, and we use it to compute 
the standard error when the null hypothesis is that p\ = p 2 (e.g. p c = p n or p c — p n = 0). 
We also typically use it to verify the success-failure condition. 



Pooled estimate of a proportion 

When the null hypothesis is p\ = P2, it is useful to find the pooled estimate of the 
shared proportion: 

number of "successes" + P21-2 

^ number of cases Hi + ri2 

Here p\ri\ represents the number of successes in sample 1 since 

number of successes in sample 1 
Pi = 

m 

Similarly, ^2^2 represents the number of successes in sample 2. 



5.5. WHEN TO RETREAT 
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TIP: Using the pooled estimate of a proportion when H 0 : Pi = P2 

When the null hypothesis suggests the proportions are equal, we use the pooled 
proportion estimate (p) to verify the success-failure condition and also to estimate 
the standard error: 



SE= P(l-P) + P(l-P) (5 30) 



© Exercise 5.31 Using Equation (5.30), p = 0.345, n\ = 491, and ri2 = 945, verify 
the estimate for the standard error is SE = 0.026. Next, complete the hypothesis test 
using a significance level of 0.05. Be certain to draw a picture, compute the p-value, 
and state your conclusion in both statistical language and plain language. A short 
answer is provided in the footnote 22 . 



5.5 When to retreat 

The conditions described for each statistical method ensure each test statistic follows the 
prescribed distribution when the null hypothesis is true. When the conditions are not met, 
these methods are not reliable and drawing conclusions from them is treacherous. The 
conditions for each test typically come in two forms. 

• The individual observations must be independent. A random sample from less than 
10% of the population ensures the observations are independent (not to be confused 
with a minimum of 10 successes and 10 failures). In experiments, other considerations 
must be made to ensure observations are independent. If independence fails, then 
advanced techniques must be used, and in some such cases, inference regarding the 
target population may not be possible. 

• Other conditions focus on sample size and skew. For example, if the sample size is 
too small or the skew too strong, then the normal model for the sample mean will 
fail and the methods described in this chapter are not reliable. 

For analyzing smaller samples, we refer to Chapter 6. 

Verification of conditions for statistical tools is always necessary. Whenever conditions 
are not satisfied for a statistical technique, there are three options. The first is to learn 
new methods that are appropriate for the data. The second route is to hire a statistician 23 . 
The third route is to ignore the failure of conditions. This last option effectively invalidates 
any analysis and may discredit novel and interesting findings. 

Finally, we caution that there may be no inference tools helpful when considering data 
that include unknown biases, such as convenience samples. For this reason, there are books, 
courses, and researchers devoted to the techniques of sample and experimental design. See 
Sections 1.5-1.7 for basic principles of sampling and data collection. 

22 Compute the test statistic: 

point estimate — null value 0.067 — 0 



Looking this value up in the normal probability table: 0.9951. However this is the lower tail, and the upper 
tail represents the p-value: 1 — 0.9951 = 0.0049. We reject the null hypothesis and conclude that dogs 
getting cancer and owners using 2,4-D are associated. 

23 If you work at a university, then there may be campus consulting services to assist you. Alternatively, 
there are many private consulting firms that are also available for hire. 



212 



CHAPTER 5. LARGE SAMPLE INFERENCE 



5.6 Testing for goodness of fit using chi-square (special 
topic) 

In this section, we develop a method for assessing a null model when the data are binned. 
This technique is commonly used in two circumstances: 

• Given a sample of cases that can be classified into several groups, we would like to 
determine if the sample is representative of the general population. 

• Evaluate whether data resemble a particular distribution, such as the normal distri- 
bution or a geometric distribution. 

Each of these scenarios can be addressed using the same statistical test: a chi-square test. 

In the first case, we consider data from a random sample of 275 jurors in a small 
county. Jurors identified their racial group, as shown in Table 5.14, and we would like 
to determine if these jurors are racially representative of the population. If the jury is 
representative of the population, then the proportions in the sample should roughly reflect 
the population of eligible jurors, i.e. registered voters. 



Race 


White 


Black 


Hispanic 


Other 


Total 


Representation in juries 


205 


26 


25 


19 


275 


Registered voters 


0.72 


0.07 


0.12 


0.09 


1.00 



Table 5.14: Representation by race in a city's juries and population. 



While the proportions in the juries do not precisely represent the population propor- 
tions, it is unclear whether these data provide convincing evidence that the sample is not 
representative. If the jurors really were randomly sampled from the registered voters, we 
might expect small differences due to chance. However, unusually large differences may 
provide convincing evidence that the juries were not representative. 

A second application, assessing the fit of a distribution, is presented at the end of this 
section. Daily stock returns from the S&P500 for 1990-2009 are used to assess whether 
stock activity each day is independent of the stock's behavior on previous days. 

In these problems, the strategy is not to assess or compare only one or two bins at a 
time. We would like to examine all bins simultaneously, which will require us to develop a 
new test statistic. While the details of the test will be new, the general ideas of hypothesis 
testing will be familiar. 

5.6.1 Creating a test statistic for one-way tables 

0 Example 5.32 Of the folks in the city, 275 served on a jury. If the individuals are 
randomly selected to serve on a jury, about how many of the 275 people would we 
expect to be white? How many would we expect to be black? 

About 72% of the population is white, so we would expect about 72% of the jurors 
to be white: 0.72 * 275 = 198. 

Similarly, we would expect about 7% of the jurors to be black, which would correspond 
to about 0.07 * 275 = 19.25 black jurors. 

Q Exercise 5.33 Twelve percent of the population is Hispanic and 9% represent other 
races. How many of the 275 jurors would we expect to be Hispanic or from another 
race? Answers can be found in Table 5.15. 
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Race 


White 


Black 


Hispanic 


Other 


Total 


Observed data 


205 


26 


25 


19 


275 


Expected counts 


198 


19.25 


33 


24.75 


275 



Table 5.15: Actual and expected make-up of the jurors. 



The sample proportion represented from each race among the 275 jurors was not a 
precise match for any ethnic group. While some sampling variation is expected, we would 
expect the sample proportions to be fairly similar to the population proportions if there 
is no bias on juries. We need to test whether the differences are strong enough to provide 
convincing evidence that the jurors are not a random sample. These ideas can be organized 
into hypotheses: 

Hq: The jurors are a random sample. That is, there is no racial bias in who serves on a 
jury, and the observed counts reflect natural sampling fluctuation. 

Ha'- The jurors are not randomly sampled, i.e. there is racial bias in juror selection. 

To assess these hypotheses, we quantify how different the observed counts are from the 
expected counts. Strong evidence for the alternative hypothesis would come in the form of 
unusually large deviations in the groups from what would be expected based on sampling 
variation alone. 



5.6.2 The chi-square test statistic 

In previous hypothesis tests, we constructed a test statistic of the following form: 

point estimate — null value 
SE of point estimate 

This construction was based on (1) identifying the difference between a point estimate 
and an expected value if the null hypothesis was true, and (2) standardizing that difference 
using the standard error of the point estimate. These two ideas will help in the construction 
of an appropriate test statistic for count data. 

Our strategy will be to first compute the difference between the observed counts and 
the counts we would expect if the null hypothesis was true, then we will standardize the 
difference: 

observed white count — null white count 
SE of observed white count 

The standard error for the point estimate of the count in binned data is the square root of 
the count under the null 24 . Therefore: 

205 - 198 

Zt = = 0.50 

7198 



24 Using some of the rules learned in earlier chapters, we might think that the standard error would be 
np(l — p), where n is the sample size and p is the proportion in the population. This would be correct if 
we were looking only at one count. However, we are computing many standardized differences and adding 
them together. It can be shown - though not here - that the square root of the count is a better way to 
standardize the count differences. 
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The fraction is very similar to previous test statistics: first compute a difference, then 
standardize it. These computations should also be completed for the black, Hispanic, and 
other groups: 

Black Hispanic Other 

26 - 19.25 25 - 33 19 - 24.75 

Z 2 = , = 1.54 Z 3 = =- = -1.39 Z 4 = , = -1.16 

V19.25 V33 V24.75 



We would like to determine if these four standardized differences are irregularly far from 
zero using a single test statistic. That is, Z\ % Z2, Z3, and Z4 must be combined somehow 
to help determine if they - as a group - tend to be unusually far from zero. A first thought 
might be to take the absolute value of these four standardized differences and add them 
up: 

|Zi| + |Z 2 | + |Z 3 | + |Z 4 | = 4.58 

Indeed, this does give one number summarizing how far the actual counts are from what 
was expected. However, it is more common to add the squared values: 

Z\ + Zf + Zf + Zf = 5.89 

Squaring each standardized difference before adding them together does two things: 

• Any standardized difference that is squared will now be positive. 

• Differences that already looked unusual - e.g. a standardized difference of 2.5 - would 
become much larger after being squared. 

We commonly use the test statistic X 2 , which is the sum of the Z 2 values: 
X 2 X 2 = Z\ + Zj + Zl + Zl = 5.89 

chi-square 

test statistic This expression can also be written using the observed counts and null counts: 

2 (observed counti — null counti) 2 (observed count4 — null count.4) 2 

null counti null count4 

The final number X 2 summarizes how strongly the observed counts tend to deviate from 
the null counts. In Section 5.6.4, we will see that if the null hypothesis is true, then X 2 
follows a new distribution called a chi-square distribution. Using this distribution, we will 
be able to obtain a p-value to evaluate the hypotheses. 

5.6.3 The chi-square distribution and finding areas 

The chi-square distribution is sometimes used to characterize data sets and statistics 
that are always positive and typically right skewed. Recall the normal distribution had 
two parameters - mean and standard deviation - that could be used to describe its exact 
characteristics. The chi-square distribution has just one parameter called degrees of 
freedom (df), which influences the shape, center, and spread of the distribution. 

Q Exercise 5.34 Figure 5.16 shows three chi-square distributions, (a) How does the 
center of the distribution change when the degrees of freedom is larger? (b) What 
about the variability (spread)? (c) How does the shape change? 
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Degrees of Freedom 

2 

- - 4 
9 



0 5 10 15 20 25 

Figure 5.16: Four chi-square distributions with varying degrees of freedom. 

Figure 5.16 and Exercise 5.34 demonstrate three general properties of chi-square dis- 
tributions as the degrees of freedom increases: the distribution becomes more symmetric, 
the center moves to the right, and the variability inflates. 

Our principal interest in the chi-square distribution is the calculation of p- values, which 
(as we have seen before) is related to finding the relevant area in the tail of a distribution. 
To do so, a new table is needed: the chi-square probability table, partially shown in 
Table 5.17. A more complete table is presented in Appendix C.3 on page 366. This table 
differs a bit from the normal probability table, in that we typically do not find the tail 
area very precisely. Instead, we identify a range for the area. Additionally, the chi-square 
probability table only provides upper tail values as opposed to the normal probability table, 
which shows lower tail areas. 



Upper tail 


0.3 


0.2 


0.1 


0.05 


0.02 


0.01 


0.005 


0.001 


df 1 


1.07 


1.64 


2.71 


3.84 


5.41 


6.63 


7.88 


10.83 


2 


2.41 


3.22 


4.61 


5.99 


7.82 


9.21 


10.60 


13.82 


3 


3.66 


4.64 


6.25 


7.81 


9.84 


11.34 


12.84 


16.27 


4 


4.88 


5.99 


7.78 


9.49 


11.67 


13.28 


14.86 


18.47 


5 


6.06 


7.29 


9.24 


11.07 


13.39 


15.09 


16.75 


20.52 


6 


7.23 


8.56 


10.64 


12.59 


15.03 


16.81 


18.55 


22.46 


7 


8.38 


9.80 


12.02 


14.07 


16.62 


18.48 


20.28 


24.32 



Table 5.17: A section of the chi-square probability table. A complete table 
is in Appendix C.3 on page 366. 



0 Example 5.35 Figure 5.18(a) shows a chi-square distribution with 3 degrees of 
freedom and an upper shaded tail starting at 6.25. Use Table 5.17 to estimate the 
shaded area. 

This distribution has three degrees of freedom, so only the row with 3 degrees of 
freedom (df) is relevant. This row has been italicized in the table. Next, we see that 
the value - 6.25 - falls in the column with upper tail area 0.1. That is, the shaded 
upper tail of Figure 5.18(a) has area 0.1. 
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0 Example 5.36 We rarely observe the exact value in the table. For instance, Fig- 
ure 5.18(b) shows the upper tail of a chi-square distribution with 2 degrees of freedom. 
The bound for this upper tail is at 4.3, which does not fall in Table 5.17. 

The cutoff of interest - 4.3 - falls between the second and third columns in the 2 
degrees of freedom row. Because these columns correspond to tail areas of 0.2 and 
0.1, we can be certain that the area shaded in Figure 5.18(b) is between 0.1 and 0.2. 

0 Example 5.37 Figure 5.18(c) shows an upper tail area for a chi-square distribution 
with 5 degrees of freedom and a cutoff of 5.1. Find the tail area. 

Looking in the row with 5 df, 5.1 falls below the smallest cutoff for this row (6.06). 
That means we can only say that the area is greater than 0.3. 

Q Exercise 5.38 Figure 5.18(d) shows a cutoff of 11.7 on a chi-square distribution 
with 7 degrees of freedom. Find the area of the upper tail. Answer in the footnote 25 . 



Q Exercise 5.39 Figure 5.18(e) shows a cutoff of 10 on a chi-square distribution with 
4 degrees of freedom. Find the area of the upper tail. Short answer in the footnote 26 . 



© Exercise 5.40 Figure 5.18(f) shows a cutoff of 9.21 with a chi-square distribution 
with 3 df. Find the area of the upper tail. Short answer in the footnote 27 . 



5.6.4 Finding a p-value for a chi-square test 

In Section 5.6.2, we identified a new test statistic (X 2 ) within the context of assessing 
whether there was evidence of racial bias in how jurors were sampled. The null hypothesis 
represented the claim that jurors were randomly sampled and there was no racial bias. The 
alternative hypothesis was that there was racial bias in how the jurors were sampled. 

We determined that a large X 2 value would suggest strong evidence favoring the 
alternative hypothesis: that there was racial bias. However, we could not quantify what the 
chance was of observing such a large test statistic (X 2 = 5.89) if the null hypothesis actually 
was true. This is where the chi-square distribution becomes useful. If the null hypothesis 
was true and there was no racial bias, then X 2 would follow a chi-square distribution, in 
this case with three degrees of freedom. In general, the statistic X 2 follows a chi-square 
distribution with k — 1 degrees of freedom, where k is the number of bins. 

0 Example 5.41 How many categories were there in the juror example? How many 
degrees of freedom should be associated with the chi-square distribution used for X 2 ? 

In the jurors example, there were k = 4 categories: white, black, Hispanic, and other. 
According to the rule above, the test statistic X 2 should then follow a chi-square 
distribution with A; — 1 = 3 degrees of freedom if Hq is true. 

Just like we checked sample size conditions to use the normal model in earlier sections, 
we must also check a sample size condition to safely apply the chi-square distribution for 
X 2 . Each expected count must be at least 28 10. In the juror example, the expected counts 
were 198, 19.25, 33, and 24.75, all easily above 10, so we can apply the chi-square model 
to the test statistic, X 2 = 5.89. 

26 The value 11.7 falls between 9.80 and 12.02 in the 7 df row. Thus, the area is between 0.1 and 0.2. 

26 The area is between 0.02 and 0.05. 

"Between 0.02 and 0.05. 

28 Some books recommend a threshold of 5. 
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Figure 5.18: (a) Chi-square distribution with 3 degrees of freedom, area 
above 6.25 shaded, (b) 2 degrees of freedom, area above 4.3 shaded, (c) 5 
degrees of freedom, area above 5.1 shaded, (d) 7 degrees of freedom, area 
above 11.7 shaded, (e) 4 degrees of freedom, area above 10 shaded, (f) 3 
degrees of freedom, area above 9.21 shaded. 
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0 5 10 15 



Figure 5.19: The p- value for the juror hypothesis test is shaded in the chi- 
square distribution with df = 3. 

0 Example 5.42 If the null hypothesis were true, the test statistic X 2 = 5.89 would 
be closely associated with a chi-square distribution with three degrees of freedom. 
Using this distribution and test statistic, identify the p-value. 

The chi-square distribution and p-value are shown in Figure 5.19. Because larger chi- 
square values correspond to stronger evidence against the null hypothesis - i.e. larger 
deviations from what we expect - we shade the upper tail to represent the p-value. 
Using the chi-square probability table in Appendix C.3 or the short table on page 215, 
we can determine that the area is between 0.1 and 0.2. That is, the p-value is larger 
than 0.1 but smaller than 0.2. Generally we do not reject the null hypothesis with 
such a large p-value. In other words, the data do not provide convincing evidence of 
racial bias in the juror selection. 



Chi-square test for one-way table 

Suppose we are to evaluate whether there is convincing evidence that a set of 
observed counts 0±, O2, Ok in k categories are unusually different from what 
might be expected under a null hypothesis. Call the expected counts that are based 
on the null hypothesis E\, E2, E^. If each expected count is at least 10 and 
the null hypothesis is true, then the following test statistic follows a chi-square 
distribution with k — 1 degrees of freedom: 

x2 = (Oi-igi) 2 + (Q 2 ~ E 2 f + + (O k - E k f 
Ei E 2 Ek 

The p-value for this test statistic is found by looking at the upper tail of this chi- 
square distribution. We consider the upper tail because larger values of X 2 would 
provide greater evidence against the null hypothesis. 



TIP: Conditions for the chi-square test 

There are two conditions that must be checked before performing a chi-square test: 

Independence. Each case that contributes a count to the table must be indepen- 
dent of all the other cases in the table. 

Sample size / distribution. Just like for proportions, each particular scenario 
(i.e. cell count) must have at least 10 cases. 

Failing to check conditions may unintentionally affect the test's error rates. 
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5.6.5 Evaluating goodness of fit for a distribution 

Section 3.3 would be useful background reading for this example, but it is not a prerequisite. 

We can apply our new chi-square testing framework to the second problem in this 
section: evaluating whether a certain statistical model fits a data set. Daily stock returns 
from the S&P500 for 1990-2009 can be used to assess whether stock activity each day is 
independent of the stock's behavior on previous days. This sounds like a very complex 
question - and it is - but a chi-square test can be used to study the problem. We will label 
each day as Up or Down (D) depending on whether the market was up or down that day. 
For example, consider the following changes in price, their new labels of up and down, and 
then the number of days that must be observed before each Up day: 

Change in price 2.52 -1.46 0.51 -4.07 3.36 1.10 -5.46 -1.03 -2.99 1.71 
Outcome Up D Up D Up Up D D D Up 

Days to Up 1 - 2 - 2 1 - - - 4 

If the days really are independent, then the number of days until a positive trading day 
should follow a geometric distribution. The geometric distribution describes the probability 
of waiting for the k th trial to observe the first success. Here each Up day represents a success, 
and down (D) days represent failures. In the data above, it took only one day until the 
market was up, so the first wait time was 1 day. It took two more days before we observed 
our next Up trading day, and two more for the third Up day. We would like to determine if 
these counts (1, 2, 2, 1, 4, and so on) follow the geometric distribution. Table 5.20 shows 
the number of waiting days for a positive trading day during 1990-2009 for the S&P500. 



Days 1 2 3 4 5 6 7+ Total 

Observed 1298 685 367 157 77 33 20 2587 



Table 5.20: Distribution of the waiting time until a positive trading day. 

We consider how many days one must wait until observing an Up day on the S&P500 
stock exchange. If the stock activity was independent from one day to the next and the 
probability of a positive trading day was constant, then we would expect this waiting time 
to follow a geometric distribution. We can organize this into a hypothesis framework: 

Hq: Whether the stock market is up or down on a given day is independent from all other 
days. We will consider the number of days that pass until an Up day is observed. 
Under this hypothesis, the number of days until an Up day should follow a geometric 
distribution. 

Ha '■ The days are not independent. Since we know the number of days until an Up day 
would follow a geometric distribution under the null, we look for deviations from the 
geometric distribution, which would support the alternative hypothesis. 

There are important implications in our result for stock traders: if information from past 
trading days is useful in telling what will happen today, that may provide an edge over 
other traders. 

We consider data for the S&P500 from 1990 to 2009 and summarize the waiting times 
in Table 5.21 and Figure 5.22. The S&P500 was positive on 52.9% of those days. 

Because applying the chi-square framework requires expected counts to be at least 
10, we have binned together all the cases where the waiting time was at least 7 days. The 
actual data, shown in the Observed row in Table 5.21, can be compared to the expected 
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Days 


1 


2 


3 


4 


5 


6 


7+ 


Total 


Observed 


1298 


685 


367 


157 


77 


33 


20 


2587 


Geom. Model 


1368 


644 


304 


143 


67 


32 


28 


2587 



Table 5.21: Distribution of the waiting time until a positive trading day. 
The expected counts based on the geometric model are shown in the last 
row. To find each expected count, we identify the probability of waiting D 
days based on the geometric model (P(D) = (1 - 0.529) D ~ 1 (0.529)) and 
multiply by the total number of streaks, 2587. For example, waiting for 
three days occurs under the geometric model about 0.471 2 * 0.529 = 11.7% 
of the time, which corresponds to 0.117 * 2587 = 304 streaks. 



□ Observed 

□ Expected 



-i 1 1— 

3 4 5 

Wait until positive day 



7+ 



Figure 5.22: Side-by-side bar plot of the observed and expected counts for 
each waiting time. 
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counts from the Geom. Model row. The method for computing expected counts is discussed 
in Table 5.21. In general, the expected counts are determined by (1) identifying the null 
proportion associated with each bin, then (2) multiplying each null proportion by the total 
count to obtain the expected counts. That is, this strategy identifies what proportion of 
the total count we would expect to be in each bin. 

Q Exercise 5.43 Do you notice any unusually large deviations in the graph? Can 
you tell if these deviations are due to chance just by looking? 

It is not obvious whether differences in the observed counts and the expected counts 
from the geometric distribution are significantly different. That is, it is not clear whether 
these deviations might be due to chance or whether they are so strong that the data provide 
convincing evidence against the null hypothesis. However, we can perform a chi-square test 
using the counts in Table 5.21. 

Q Exercise 5.44 Table 5.21 provides a set of count data for waiting times (Oi = 1298, 
O2 = 685, ...) and expected counts under the geometric distribution (Ei = 1368, 
E2 = 644, ...). Compute the chi-square test statistic, X 2 . Answer in the footnote 29 . 

Q Exercise 5.45 Because the expected counts are all at least 10, we can safely apply 
the chi-square distribution to X 2 . However, how many degrees of freedom should we 
use? Hint: How many groups are there? Solution in the footnote 30 . 

0 Example 5.46 If the observed counts follow the geometric model, then the chi- 
square test statistic X 2 = 24.43 would closely follow a chi-square distribution with 
df = 6. Using this information, compute a p-value. 

Figure 5.23 shows the chi-square distribution, cutoff, and the shaded p-value. If we 
look up the statistic X 2 = 24.43 in Appendix C.3, we find that the p-value is less 
than 0.001. In other words, we have very strong evidence to reject the notion that 
the wait times follow a geometric distribution, i.e. trading days are not independent 
and past days may help predict what the stock market will do today. 




Area representing the 
p-value is very slim! 
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Figure 5.23: Chi-square distribution with 6 degrees of freedom. The p-value 
for the stock analysis is shaded. 



29 yl _ (1298-1368) 2 , (685-644) 2 (20-28) 2 

— 1368 ' 644 + ■ ■ ■ + 28 

30 There are k = 7 groups, so we use df = k — 1 = 6. 



= 24.43 
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0 Example 5.47 In Example 5.46, we rejected the null hypothesis that the trading 
days are independent. Why is this so important? 

Because the data provided strong evidence that the geometric distribution is not 
appropriate, we reject the claim that trading days are independent. While it is not 
obvious how to exploit this information, it suggests there are some hidden patterns 
in the data that could be interesting and possibly useful to a stock trader. 

5.7 Testing for independence in two-way tables (special 
topic) 

Google is constantly running experiments to test new search algorithms. For example, 
Google might test three algorithms using a sample of 10,000 google.com search queries. 
Table 5.24 shows the 10,000 queries split into three algorithm groups 31 . The group sizes 
were specified before the start of the experiment to be 5000 for the current algorithm and 
2500 for each test algorithm. 



Table 5.24: Google experiment breakdown of test subjects into three search 
groups. 

0 Example 5.48 What is the ultimate goal of the Google experiment? What are the 
null and alternative hypotheses, in regular words? 

The ultimate goal is to see whether there is a difference in the performance of the 
algorithms. The hypotheses can be described as the following: 

Hq: The algorithms each perform equally well. 
Ha- The algorithms do not perform equally well. 

In this experiment, the explanatory variable is the search algorithm. However, an 
outcome variable is also needed. This outcome variable should somehow reflect whether 
the search results align with the user's interests. One possible way to quantify this is to 
determine whether (1) the user clicked one of the links provided and did not try a new 
search, or (2) the user performed a related search. Under scenario (1), we might think 
that the user was satisfied with the search results. Under scenario (2), the search results 
probably were not relevant, so the user tried a second search. 

Table 5.25 provides the results from the experiment. These data are very similar to 
the count data in Section 5.6. However, now the different combinations of two variables 
are binned in a two-way table. In examining these data, we want to evaluate whether there 
is strong evidence that at least one algorithm is performing better than the others. To do 
so, we apply the chi-square test to this two-way table. The ideas of this test are similar to 
those ideas in the one-way table case. However, degrees of freedom and expected counts 
are computed a little differently than before. 

31 Google regularly runs experiments in this manner to help improve their search engine. It is entirely 
possible that if you perform a search and so does your friend, that you will have different search results. 
While the data presented in this section resemble what might be encountered in a real experiment, these 
data are simulated. 



Search algorithm 
Counts 



current test 1 test 2 
5000 2500 2500 



Total 
10000 
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Search algorithm 


current 


test 1 


test 2 


Total 


No new search 


3511 


1749 


1818 


7078 


New search 


1489 


751 


682 


2922 


Total 


5000 


2500 


2500 


10000 



Table 5.25: Results of the Google search algorithm experiment. 



What is so different about one-way tables and two-way tables? 

A one-way table describes counts for each outcome in a single variable. A two-way 
table describes counts for combinations of outcomes for two variables. When we 
consider a two-way table, we often would like to know, are these variables related 
in any way? That is, are they dependent (versus independent)? 



The hypothesis test for this Google experiment is really about assessing whether there 
is statistically significant evidence that the choice of the algorithm affects whether a user 
performs a second search. In other words, the goal is to check whether the search variable 
is independent of the algorithm variable. 

5.7.1 Expected counts in two-way tables 

0 Example 5.49 From the experiment, we estimate the proportion of users who were 
satisfied with their initial search (no new search) as 7078/10000 = 0.7078. If there 
really is no difference among the algorithms and 70.78% of people are satisfied with 
the search results, how many of the 5000 folks in the current algorithm group would 
be expected to not perform a new search? 

About 70.78% of the 5000 would be satisfied with the initial search: 

0.7078 * 5000 = 3539 users 

That is, if there was no difference between the three groups, then we would expect 
3539 of the current algorithm users not to perform a new search. 

Q Exercise 5.50 Using the same rationale described in Example 5.49, about how 
many users in each of the test groups would not perform a new search if the algorithms 
were equally helpful? Short answer in the footnote 32 . 

0 Example 5.51 If 3539 of the 5000 users who are given the current algorithm are 
not expected to perform a new search, how many would we expect to perform a new 
search? 



We would expect the other 1461 users in the "current" group to perform a new 
search. 

Q Exercise 5.52 If 1769.5 of the 2500 users in each test group are not expected to 
perform a new search, how many would be expected to perform a new search in each 
of these groups? Answer in the footnote 33 . 



32 About 1769.5. It is okay that this is a fraction. 
33 2500 - 1769.5 = 730.5. 
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The expected counts from Examples 5.49 and 5.51 and Exercises 5.50 and 5.52 were 
used to construct Table 5.26. This is the same as Table 5.25, except now the expected 
counts have been added in parentheses. 



Search algorithm 


current 


test 1 




test 2 




Total 


No new search 
New search 


3511 (3539) 
1489 (1461) 


1749 
751 


(1769.5) 
(730.5) 


1818 
682 


(1769.5) 
(730.5) 


7078 
2922 


Total 


5000 


2500 




2500 




10000 



Table 5.26: The observed counts and the (expected counts). 



The examples and exercises above provided some help in computing expected counts. 
In general, expected counts for a two-way table may be computed using only the row 
totals, column totals, and the table total. For instance, if there was no difference between 
the groups, then about 70.78% of each column should be in the first row: 



0.7078 * (column 1 total) = 3539 
0.7078 * (column 2 total) = 1769.5 
0.7078 * (column 3 total) = 1769.5 



Looking back to how the fraction 0.7078 was computed - as the fraction of users who did 
not perform a new search (7078/10000) - these three expected counts could have been 
computed as 



(row 1 total \ 
(column 1 total) = 3539 
table total J y ' 



( row 1 total \ , 

column 2 total) = 1769.5 

V table total J 



row 1 total 



table total / 



(column 3 total) = 1769.5 



This leads us to a general formula for computing expected counts in a two-way table when 
we would like to test whether there is strong evidence of an association between the column 
variable and row variable. 



Computing expected counts in a two-way table 

To identify the expected count for the i th row and j column, compute 

(row i total) * (column j total) 
Expected Count row , col ■ = — — 
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5.7.2 The chi-square test statistic for two-way tables 

The chi-square test statistic for a two-way table is found the same way it is found for a 
one-way table. For each table count, compute 



General formula 



(observed count — expected count) 



Row 1, Col 1 (3511^3539)^ = Q ^ 

3539 

(1749- 1769. 5) 2 
Row 1, Col 2 = 0.237 



expected count 

•">">:>() " 

I — 17 
1769.5 



682 - 730.5 2 

Row 2, Col 3 — = 3.220 

730.5 

Adding the computed value for each cell gives the chi-square test statistic X 2 : 

X 2 = 0.222 + 0.237 + h 3.220 = 6.120 

Just like before, this test statistic follows a chi-square distribution. However, the degrees 
of freedom are computed a little differently for a two-way table 34 . For two way tables, the 
degrees of freedom is equal to 

df = (number of rows minus 1) * (number of columns minus 1) 

In our example, the degrees of freedom parameter is 

df = (2 - 1) * (3 - 1) = 2 

If the null hypothesis is true (i.e. the algorithms are equally useful), then the test statistic 
X 2 = 6.12 closely follows a chi-square distribution with 2 degrees of freedom. Using this 
information, we can compute the p-value for the test, which is depicted in Figure 5.27. 



Computing degrees of freedom for a two-way table 

When applying the chi-square test to a two-way table, we use 

df=(R-l)*(C-l) 
where R is the number of rows in the table and C is the number of columns. 



0 Example 5.53 Compute the p-value and draw a conclusion about whether the 
search algorithms have different performances. 

Looking in Appendix C.3 on page 366, we examine the row corresponding to 2 degrees 
of freedom. The test statistic, X 2 = 6.120, falls between the fourth and fifth columns, 
which means the p-value is between 0.02 and 0.05. Because we typically test at a 
significance level of a = 0.05 and the p-value is less than 0.05, the null hypothesis is 
rejected. That is, the data provide convincing evidence that there is some difference 
in performance among the algorithms. 



Recall: in the one-way table, the degrees of freedom was the number of cells minus 1. 
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0 5 10 15 



Figure 5.27: Computing the p- value for the Google hypothesis test. 





Obama 


Dem. leaders 


Rep. leaders 


Total 


Approve 


683 


465 


375 


1523 


Disapprove 


639 


855 


682 


2379 


Total 


1322 


1320 


1260 


3902 



Table 5.28: Pew Research poll results of a March 2010 poll. 



0 Example 5.54 Table 5.28 summarizes the results of a Pew Research poll 35 . We 
would like to determine if there are actually differences in the approval ratings of 
Barack Obama, the Democratic leaders, and the Republican leaders. What are ap- 
propriate hypotheses for such a test? 

Hq: There is no difference in approval ratings between the three groups. 

Ha: There is some difference in approval ratings between the three groups, e.g. per- 
haps Obama's approval differs from other Democratic leaders. 

O Exercise 5.55 A chi-square test for a two-way table may be used to test the 
hypotheses in Example 5.54. As a first step, compute the expected values for each of 
the six table cells. The computations for the expected counts in the cells of the first 
column are shown in the footnote 36 . 

Q Exercise 5.56 Compute the chi-square test statistic. Solution in the footnote 37 . 

Q Exercise 5.57 Because there are 2 rows and 3 columns, the degrees of freedom for 
the test is df = (2 - 1) * (3 - 1) = 2. Use X 2 = 142.2, df = 2, and the chi-square 
table (page 366) to evaluate whether to reject the null hypothesis. Answer in the 
footnote 38 . 

35 See the Pew Research website: http://people-press.org/report/598/hoalthcare-reform 

36 The expected count for row one / column one is found by multiplying the row one total (1523) and 

column one total (1322), then dividing by the table total (3902): 152 3 3 9 * 0 1 2 322 = 516.0. Similarly for the first 

column and the second row: 237 „ 9 * 1322 = 806.0. 

2 2 

37 For each cell, compute — exp ^ . For instance, the first row and first column: ( 683 Sis) = 54.0. 
Adding the results of each cell gives the chi-square test statistic: X 2 = 54.0 + • • • + 17.8 = 142.2. 

38 The test statistic is larger than the right-most column of the df = 2 row of the chi-square probability 
table, meaning the p- value is less than 0.001. That is, we reject the null hypothesis because the p- value is 
less than 0.05, and we conclude that Americans' approval has differences among the party leaders and the 
president. 
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5.8 Exercises 
5.8.1 Paired data 

5.1 Is there strong evidence that the continental U.S. is warming? We might take a 
simple approach to this problem and compare how temperatures have changed from 1968 
to 2008. The daily high temperature reading on January 1 was collected for 1968 and 2008 
for 51 randomly selected locations in the continental U.S. Then the difference between the 
two readings (temperature in 2008 - temperature in 1968) was calculated for each of the 
51 different locations. The average of these 51 values was 1.1 degrees with a standard 
deviation of 4.9 degrees. We are interested in finding whether these data provide strong 
evidence of temperature warming in the continental U.S. 



(a) Are the two data sets of 51 observations dependent or independent of each other? 
Based on this, what type of analysis should be conducted to test if these data provide 
strong evidence of temperature warming in the continental U.S? 

(b) Write the hypotheses in symbols and in words. 

(c) Calculate the test statistic and find the p-value. (Reminder: check assumptions and 
conditions.) 

(d) What do you conclude? Interpret your conclusion in context. 

(e) What type of error might we have made? Explain in context what the error means. 

(f) Based on the result of this hypothesis test, would you expect a confidence interval for 
the average difference between the temperature measurements from 1968 and 2008 to 
include 0? Explain your reasoning. 



5.2 During Exercise 5.1, we considered the differences between the temperature readings 
in January 1 of 1968 and 2008 at 51 locations in the continental U.S. 



(a) Calculate a 90% confidence interval for the average difference between the temperature 
measurements between 1968 and 2008. 

(b) Interpret this interval in context. 

(c) Does this interval agree with the conclusion of the hypothesis test from Exercise 5.1? 
Explain. 
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5.8.2 Difference of two means 

5.3 In 1964 the Office of the Surgeon General released their first report linking smoking 
to various health issues, including cancer. Research done by an ad agency surveyed the 
number of cigarettes smoked by 80 smokers the day before the Surgeon Generals report 
came out. The sample average was 13.5 and the standard deviation was 3.2. A year after 
the report was released, in a random sample of 85 smokers, the average number of cigarettes 
smoked per day was 12.6 with a standard deviation of 2.9. Is there strong evidence that 
the average number of cigarettes smoked per day decreased after the Surgeon General's 
report? 





Before 


After 


n 


80 


85 


X 


13.5 


12.6 


s 


3.2 


2.9 



(a) Write the hypotheses in symbols and in words. 

(b) Calculate the test statistic and find the p-value. (Reminder: check assumptions and 
conditions.) 

(c) What do you conclude? Interpret your conclusion in context. 

(d) Does this imply that the Surgeon General's report was the cause of this decrease? 
Explain. 

(e) What type of error might we have made? Explain. 

5.4 Based on the data given in Exercise 5.3, construct a 90% confidence interval for the 
difference between the average number of cigarettes smoked per day before and after the 
Surgeon General's report was released. Interpret this interval in context. Also comment on 
if the confidence interval agrees with the conclusion of the hypothesis test from Exercise 5.3. 

5.5 The National Assessment of Educational Progress tested 13 year old students in 
reading in 2004 and 2008. A random sample of 1,000 students who took the test in 2004 
yielded an average score of 257 with a standard deviation of 39. A random sample of 
1,000 students who took the test in 2008 yielded an average score of 260 with a standard 
deviation of 38. Construct a 90% confidence interval for the difference between the average 
scores in 2004 and 2008. Interpret this interval in context. (Reminder: check assumptions 
and conditions.) [23] 

5.6 Exercise 5.5 provides data on the average math scores from tests conducted by the 
National Assessment of Educational Progress in 2004 and 2008. 

(a) Do these data provide strong information that the average reading score for 13 year 
old students has changed between 2004 and 2008? Use a 10% significance level. 

(b) What type of error might we have made? Explain. 

(c) Does the conclusion of your hypothesis test agree with the confidence interval from 
Exercise 5.5? 
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5.7 Women are said to be obese if they have greater than 35% body fat, while the cutoff 
for men being termed obese is 25%. The cutoff for women is higher since women store 
extra fat in their bodies to be used during childbearing. The third National Health and 
Nutrition Examination Survey collected body fat percentage (BF) and lean mass data from 
13,601 subjects ages 20 to 79.9. A summary table for these data is given below. Note that 
BF and lean mass are given as mean ± standard error. Test the hypothesis that women on 
average have a higher body fat percentage than men using these data. You may assume 
that all assumptions and conditions for inference are satisfied. [24] 



Gender 


n 


BF (%) 


Lean mass (kg) 


Men 


6580 


23.9 ± 0.07 


61.8 ± 0.12 


Women 


7021 


35.0 ± 0.09 


44.0 ± 0.08 



Test the hypothesis that women have higher average body fat percentages than men using 
a 1% significance level. 

5.8 Exercise 5.7 also provides information on the amount of lean mass of men and women 
who were surveyed. Calculate a 95% confidence interval for the difference between the lean 
mass amounts of men and women, and interpret the interval in context. Also comment on 
whether or not this interval suggests a significant difference between the average lean mass 
amounts of men and women. 

5.8.3 Single population proportion 

5.9 Suppose that 8% of college students are vegetarians. Determine if the following 
statements are true or false, and explain your reasoning. 

(a) The distribution of the sample proportions of vegetarians in random samples of size 60 
is nearly normal since n > 50. 

(b) The distribution of the sample proportions of vegetarian college students in random 
samples of size 50 is right skewed. 

(c) A random sample of 125 college students where 12% are vegetarians would be consid- 
ered unusual. 

(d) A random sample of 250 college students where 12% are vegetarians would be consid- 
ered unusual. 

(e) The standard error would be reduced by one-half if we increased the sample size from 
125 to 250. 

5.10 Suppose that the proportion of the adult population who jog is 0.15. Determine if 
the following statements are true or false, and explain your reasoning. 

(a) The distribution of the proportions of joggers in random samples of size 40 is right 
skewed. 

(b) The distribution of the proportions of joggers in random samples of size 80 is nearly 
normal since n > 50. 

(c) A random sample of 150 where 20% are joggers would be considered unusual. 

(d) A random sample of 300 where 20% are joggers would be considered unusual. 

(e) The standard error would be reduced by one-half if we increased the sample size from 
150 to 600. 
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5.11 Ninety percent of orange tabby cats are male. Determine if the following statements 
are true or false, and explain your reasoning. 

(a) The distribution of sample proportions of samples of size 30 is left skewed. 

(b) Doubling the sample size will reduce the standard error of the sample proportion by 
one-half. 

(c) The distribution of sample proportions of samples of size 140 is approximately normal. 

(d) Doubling the sample size will reduce the standard error of the sample proportion by a 
factor of ^/2. 

(e) The distribution of sample proportions of samples of size 70 is approximately normal. 

5.12 In a poll conducted by Survey USA on July 12, 2010, 70% of the 119 respondents 
between the ages of 18 and 34 said they would vote in the 2010 general election for Prop 
19, which would change California law to legalize marijuana and allow it to be regulated 
and taxed. At a 95% confidence level, this sample has an 8% margin of error. Based on 
this information, determine if the following statements are true or false, and explain your 
reasoning. 

(a) We are 95% confident that between 62% and 78% of the California voters in this sample 
support support Prop 19. 

(b) We are 95% confident that between 62% and 78% of all California voters between the 
ages of 18 and 34 support Prop 19. 

(c) If we considered many random samples of 119 California voters between the ages of 18 
and 34, and we calculated the sample proportions of those who support Prop 19, 95% 
of them will be between 62% and 78%. 

(d) In order to decrease the margin of error to 4%, we would need to quadruple (multiply 
by 4) the sample size. 

(e) Based on this confidence interval, there is sufficient evidence to conclude that a majority 
of California voters between the ages of 18 and 34 support Prop 19. 

5.13 We are interested in estimating the proportion of graduates at a mid-sized university 
who found a job within one year of completing their undergraduate degree. We conduct 
a survey and find out that 348 of the 400 randomly sampled graduates found jobs. The 
graduating class under consideration included approximately 4500 students. 

(a) Describe the population parameter of interest. What is the value of the point estimate 
of this parameter? 

(b) Construct a 95% confidence interval for the proportion of graduates who found a job 
within one year of completing their undergraduate degree at this university. (Reminder: 
check assumptions and conditions.) 

(c) Explain what this interval means in the context of this question. 

(d) What does "95% confidence" mean? 

(e) Without doing any calculations, describe what would happen to the confidence interval 
if we decided to use a higher confidence level. 

(f ) Without doing any calculations, describe what would happen to the confidence interval 
if we used a larger sample. 
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5.14 After implementing a study abroad program, a university conducted a study to find 
out what percent of students had already traveled to another country. The survey showed 
that 42 out of 100 sampled students had previously visited abroad. 

(a) Describe the population parameter of interest. What is the value of the point estimate 
of this parameter? 

(b) Construct a 90% confidence interval for the proportion of students at this university 
who have traveled abroad. (Reminder: check assumptions and conditions.) 

(c) Interpret this interval in context. 

(d) What does "90% confidence" mean? 

5.15 It is believed that large doses of acetaminophen (the active ingredient in over the 
counter pain relievers like Tylenol) may cause damage to the liver. A researcher wants to 
conduct a study to estimate the proportion of acetaminophen users who have liver damage. 
For participating in this study, she will pay each subject $20 and provide a free medical 
consultation if the patient has liver damage. 

(a) If she wants to limit the margin of error of her 98% confidence interval to 2%, what is 
the minimum amount of money she needs to set aside to pay her subjects? 

(b) The amount you calculated in part (a) is substantially over her budget so she decides 
to use fewer subjects. How will this affect the width of her confidence interval. 

5.16 We are interested in estimating the proportion of students at a university who smoke. 
Out of a random sample of 200 students from this university, 40 students smoke. 

(a) Construct a 95% confidence interval for the proportion of students at this university 
who smoke, and interpret this interval in context. 

(b) Construct a 99% confidence interval for the proportion of students at this university 
who do not smoke, and interpret this interval in context. 

(c) Compare the widths of the 95% and 99% confidence intervals. Which one is wider? 
Can you explain why this is the case? 

(d) If we wanted the margin of error to be no larger than 2% for at a 95% confidence level 
for the proportion of students who smoke, how big of a sample would we need? 

5.17 In January 2011, The Marist Poll published a report stating that 66% of adults 
nationally think licensed drivers should be required to re-take their road test once they reach 
65 years of age. It was also reported that interviews were conducted on 1,018 American 
adults, and that the margin of error was 3% using a 95% confidence level. [25] 

(a) Verify the margin of error reported by The Marist Poll. The data collected was based 
on simple random sampling. 

(b) Does the poll contain strong evidence that more than two thirds of the population 
think that licensed drivers should be required to re-take their road test once they turn 
65? 
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5.18 Exercise 5.14 provides the result of a campus wide survey where 42 of 100 randomly 
sampled students have traveled abroad. A comprehensive follow-up survey was conducted 
one year after the study abroad program was implemented. It sampled all students at the 
university and found that 51% of the university's students had traveled to another country. 
Do you think there has been an increase in the number of students who traveled abroad 
since the first survey? Conduct a hypothesis test to check. Would you come to the same 
conclusion using your confidence interval from Exercise 5.14? 

5.19 A national survey conducted in 2011 among a simple random sample of 1,507 adults 
shows that 56% of Americans think the Civil War is still relevant to American politics and 
political life. [26] 

(a) Conduct a hypothesis test to determine if these data provide strong evidence that 
majority of the Americans think the Civil War is still relevant. 

(b) Interpret the p- value in context. 

(c) Calculate a 90% confidence interval for the proportion of Americans who think the 
Civil War is still relevant. Interpret the interval in context and comment on whether 
or not the confidence interval agrees with the conclusion of the hypothesis test. 

5.20 A college review magazine states that in many business schools there is a certain 
stigma that marketing is a less stressful major and so most students (>50%) majoring in 
marketing also major in finance, economics or accounting to be able to show employers that 
their quantitative skills are also strong. In order to test this claim, an education researcher 
collects a simple random sample of 80 undergraduate students majoring in marketing at 
various business schools and finds that 50 of them have a double major. 

(a) Conduct a hypothesis test to determine if these data provide strong evidence supporting 
this magazine's claim that majority of marketing students have a double major. 

(b) Interpret the p- value in context. 

(c) Calculate a 90% confidence interval for the proportion of marketing students who have 
a double major. Interpret the interval in context and comment on whether or not the 
confidence interval agrees with the conclusion of the hypothesis test. 

5.21 Among a simple random sample of 331 American adults who do not have a four-year 
college degree and are not currently enrolled in school, 48% said they decided to not go to 
college because they could not afford school. [27] 

(a) A newspaper article states that only a minority of the Americans who decide not to 
go to college do so because they cannot afford it, and uses the point estimate from 
this survey as evidence. Conduct a hypothesis test to determine if these data provide 
strong evidence supporting this statement. Use a 10% significance level. 



(b) Would you expect a confidence interval for the proportion of American adults who 
decide to not go to college because they cannot afford it to include 0.5? Explain. 
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5.22 A Washington Post article reports that "support for a government-run health-care 
plan to compete with private insurers has rebounded from its summertime lows and wins 
clear majority support from the public." More specifically, the article says "seven in 10 
Democrats back the plan, while almost nine in 10 Republicans oppose it. Independents 
divide 52 percent against, 42 percent in favor of the legislation." There were were 819 
Democrats, 566 Republicans and 783 Independents surveyed. [28] 

(a) A political pundit on TV claims that a majority of Independents oppose the public 
option health care plan. Do these data provide strong evidence to support this state- 
ment? 

(b) Would you expect a confidence interval for the proportion of Independents who oppose 
the public option plan to include 0.5? Explain. 

5.23 Exercise 5.21 presents the results of a poll where 48% of 331 Americans who decide 
to not go to college do so because they cannot afford it. 

(a) Construct an 80% confidence interval for the proportion of Americans who decide to not 
go to college do so because they cannot afford it. Interpret it in context and comment 
on whether or not the confidence interval agrees with the conclusion of the hypothesis 
test from Exercise 5.21. 

(b) Suppose we wanted the margin of error for the 80% confidence level to be about 1.5%. 
How large of a survey would you recommend? 

5.24 Exercise 5.22 presents the results of a poll evaluating support for the public option 
health care plan in 2009. 

(a) Construct an 90% confidence interval for the proportion of Independents who support 
the public option health care plan. Interpret it in context and comment on whether 
or not the confidence interval agrees with the conclusion of the hypothesis test from 
Exercise 5.22? 

(b) If we wanted to limit the margin of error of a 90% confidence interval to 1%, about 
how many Independents would we need to survey? 

5.25 A report by the Centers for Disease Control and Prevention states that 30% of 
Americans are habitually getting less than six hours of sleep a night - less than the rec- 
ommended seven to nine hours. New York is known as "the city that never sleeps" . In a 
simple random sample of 300 New Yorkers, it was found that 105 of them get less than six 
hours of sleep a night. 

(a) Do these data provide strong evidence that the rate of sleep deprivation for New Yorkers 
is higher than the rate of sleep deprivation in the population at large? 

(b) Interpret the p- value in context. 
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5.26 Some people claim that they can tell the difference between a diet soda and a regular 
soda in the first sip. A researcher wanting to test this claim randomly sampled 80 such 
people. He then filled 80 plain white cups with soda, half diet and half regular through 
random assignment, and asked each person to take one sip from their cup and identify the 
soda as "diet" or "regular" . 53 participants correctly identified the soda. 

(a) Do these data provide strong evidence that these people are able to detect the difference 
between diet and regular soda, in other words, are the results significantly better than 
just random guessing? 

(b) Interpret the p- value in context. 

5.27 A large survey conducted five years ago at a university showed that 18% of the 
university students smoked. A more recent survey (simple random sample) found that 40 
of 200 students at the university smoked. 

(a) Do the data provide strong evidence that the percentage of students who smoke has 
changed over the last five years? 

(b) What type of error might we have made? 

5.28 The corporate management at a paper company has reason to believe one of its 
(supposed) star managers, Michael, is making misleading claims about his regional market 
share. Michael recently claimed that his office is the sole paper provider to 45% of its 
regions businesses. When the senior management conducted its own survey, they found 
that 36% of 180 randomly sampled businesses purchased their paper from Michaels office. 

(a) Does this provide strong evidence that Michael is misleading the upper management 
about his offices performance? Conduct a full hypothesis test. 

(b) What type of error might we have made? 

5.29 Statistics show that traditionally about 65% of students in a particular rural school 
district go out of state for college. The school board would like to see if this number will 
increase next year. To check, they randomly sample 250 college-bound high school students 
and discover 172 of these students say they will be going to school out of state. 

(a) Do these data provide strong evidence that the percentage of students in this rural 
school district who go out of state for college has increased? 

(b) Interpret the p- value in context. 

5.30 A law firm associate interviewed 300 randomly selected residents from counties where 
it is believed mining pollution has caused elevated mercury levels in the water supply. She 
finds that 38 individuals have higher blood levels of mercury than what the EPA accepts 
as reasonable; only about 7% of the general population has such high levels of mercury. 

(a) Does this provide strong evidence that individuals in these counties are at greater risk 
of having elevated mercury levels? 



(b) Can the result of your hypothesis test be used to prove that mining pollution has caused 
elevated mercury levels in the water supply? 
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5.8.4 Difference of two proportions 

5.31 In Exercise 5.29, we were introduced to data from a rural school district where 172 
out of 250 college-bound high school students planned to go out of state for college. In 
a similar survey conducted in an urban school district, 450 out of 930 randomly sampled 
college-bound high school students planned to go out. 

(a) Calculate a 95% confidence interval for the difference between the proportions of high 
school students from the rural and the urban district who go out of state for college. 
(Reminder: check conditions and assumptions.) 

(b) Interpret the confidence interval and describe its practical implications. 

5.32 Exercise 5.31 presents the results of two surveys conducted in a rural and an urban 
school district. 

(a) Conduct a hypothesis test to determine if there is strong evidence to suggest that a 
higher proportion of students from the rural district go out of state for college. 

(b) Note any changes you would make to your setup, calculations, and conclusion in part 
(a) if you were looking for any difference rather than using a one-sided test. 

(c) Which of the tests do you think is most appropriate? Explain. 

5.33 Exercise 5.22 presents the results of a poll evaluating support for the public option 
health care plan in 2009. 70% of 819 Democrats and 42% of 783 Independents support the 
public option. 

(a) Do these data provide strong evidence that a higher proportion of Democrats than 
Independents support the public option plan? 

(b) What type of error might we have made? 

(c) Would you expect a confidence interval for the difference between the two proportions 
to include 0? Explain your reasoning. If you answered no, would you expect the 
confidence interval for (po — pi) to be positive or negative? 

(d) Calculate a 95% confidence interval for the difference between (pjj — pi) and interpret 
it in context. 

(e) True or false: If we had picked a random Democrat and a random Independent at the 
time of this poll, it is more likely that the Democrat would support the public option 
than the Independent. 

5.34 According to a report on sleep deprivation by the Centers for Disease Control and 
Prevention, the proportion of California residents who reported insufficient rest or sleep 
during each of the preceding 30 days is 8.0%, while this proportion is 8.8% for Oregon 
residents. These data are based on simple random samples of 11,545 California and 4,691 
Oregon residents. [29] 

(a) What kind of study is this? 

(b) Conduct a hypothesis test to determine if these data provide strong evidence the rate 
of sleep deprivation is different for the two states. 

(c) Explain what type of error we might have made in this hypothesis test. 

(d) Would you expect a confidence interval for the difference between the two proportions 
to include 0? Explain your reasoning. 
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5.35 Using the data provided in Exercise 5.34, construct and interpret a 95% confidence 
interval for the difference between the population proportions. If we had instead conducted 
a hypothesis test to check whether the proportions were equal, what conclusion would your 
confidence interval support? 

5.36 A study published in 2001 asked 1924 male and 3666 female undergraduate college 
students their favorite color. A 95% confidence interval for the difference between the pro- 
portions of males and females whose favorite color is black (p ma ie — P female) was calculated 
to be (0.02, 0.06). Based on this information, determine if the following statements are 
true or false, and explain your reasoning. [30] 

(a) We are 95% confident that the true proportion of males whose favorite color is black 
is 2% lower to 6% higher than the true proportion of females whose favorite color is 
black. 

(b) We are 95% confident that the true proportion of males whose favorite color is black 
is 2% to 6% higher than the true proportion of females whose favorite color is black. 

(c) 95% of random samples will produce 95% confidence intervals that include the true 
difference between the population proportions of males and females whose favorite 
color is black. 

Continue to parts (d) and (e) on the next page. 

(d) We can conclude that there is a significant difference between the proportions of males 
and females whose favorite color is black and that the difference between the two sample 
proportions was too large to plausibly be due to chance. 

(e) The 95% confidence interval for (pf em aie — Pmaie) cannot be calculated with only the 
information given in this exercise. 

5.37 Researchers studying the effectiveness of an anti-anxiety medication randomly sam- 
pled 100 patients diagnosed with general anxiety disorder (GAD). Through random assign- 
ment, half of the patients received a placebo and the other half received the anti-anxiety 
medication. The treatment was considered a success if the patient experienced a reduction 
in their anxiety level. A 95% confidence interval for the difference between the proportion 
of success in the two groups (p 'medication — Ppiacebo) was (-0.02, 0.24). Based on this infor- 
mation, determine if the following statements are true or false, and explain your reasoning. 

(a) We are 95% confident that the true proportion of success among the population of 
people who take the medication is 2% lower to 24% higher than the true proportion of 
success among the population of people who take a placebo. 

(b) 95% of random samples will produce a difference between -0.02 and 0.24. 

(c) The 95% confidence interval for (p p i aC ebo — Pmedication) cannot be calculated with only 
the information given in this exercise. 

(d) The 95% confidence interval for (p p i a cebo - Pmedication) would be (-0.24, 0.02). 

(e) We can conclude that the medication was effective and the difference between the two 
sample proportions was too large to plausibly be due to chance. 
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5.38 A 2010 survey asked 827 randomly sampled registered voters in California "Do you 
support? Or do you oppose? Drilling for oil and natural gas off the Coast of California? Or 
do you not know enough to say?" Below is the distribution of responses, separated based 
on whether or not the respondent graduated from college. [31] 

(a) What percent of college graduates and what per- 
cent of the non-college graduates in this sample do 
not know enough to have an opinion on drilling for 
oil and natural gas off the Coast of California? 

(b) Conduct a hypothesis test to determine if the data 
provide strong evidence that the proportion of col- 
lege graduates who do not have an opinion on this 
issue is different than that of non-college gradu- 
ates. 

5.39 Exercise 5.38 presents the results of a poll evaluating support for drilling for oil and 
natural gas off the coast of California. 

(a) What percent of college graduates and what percent of the non-college graduates in 
this sample support drilling for oil and natural gas off the Coast of California? 

(b) Conduct a hypothesis test to determine if the data provide strong evidence that the 
proportion of college graduates who support off-shore drilling in California is different 
than that of non-college graduates. 

5.40 A news article reports that "Americans have differing views on two potentially 
inconvenient and invasive practices that airports could implement to uncover potential 
terrorist attacks." This news piece was based on a survey conducted among a random 
sample of 1,137 adults nationwide, interviewed by telephone November 7-10, 2010, where 
one of the questions on the survey was "Some airports are now using 'full-body' digital 
x-ray machines to electronically screen passengers in airport security lines. Do you think 
these new x-ray machines should or should not be used at airports?" Below is a summary 
of responses based on party affiliation. [32] 

Party Affiliation 
Republican Democrat Independent 
Should 264 299 351 

Answer Should not 38 55 77 

Don't know/No answer 16 15 22 

Total 318 369 450 

Conduct an appropriate hypothesis test evaluating whether there is a difference in the 
proportion of Republicans and Democrats who think the full-body scans should be applied 
in airports. After making your conclusion, describe what type of error we might have 
committed with this test. 

5.41 Exercise 5.40 presents the results of a poll on public opinion on the use of full body 
scans at airports. 

(a) Calculate and interpret a 90% confidence interval for the difference between the pro- 
portion of Republicans and Democrats who support full-body scans. 

(b) Does this prove that a there is no difference in opinion on the use of full-body scans 
between Republicans and Democrats? Explain. 



College Grad 
Yes No 
Support ~T54 132~~ 
Oppose 180 126 

Do not know 104 131 
Total 438 389 
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5.8.5 When to retreat 

5.42 The Stanford University Heart Transplant Study was conducted to determine whether 
an experimental heart transplant program increased lifespan. Each patient entering the pro- 
gram was designated officially a heart transplant candidate, meaning that he was gravely ill 
and would most likely benefit from a new heart. Some patients got a transplant (treatment 
group) and some did not (control group). The table below displays how many patients 
survived and died in each group. [10] 





control 


treatment 


alive 


4 


24 


dead 


30 


45 



A hypothesis test would reject the conclusion that the survival rate is the same in each 
group, and so we might like to construct a confidence interval. Explain why we cannot 
construct such an interval using our large sample techniques. What might go wrong if we 
constructed the confidence interval despite this problem? 

5.8.6 Testing for goodness of fit using chi-square 

5.43 A professor using an open-source introductory statistics book predicts that 60% of 
the students will purchase a hard copy of the book, 25% will print it out from the web, 
and 15% will read it online. At the end of the semester she asks her students to complete 
a survey where they indicate what format of the book they used. Of the 126 students, 71 
said they bought a hard copy of the book, 30 said they printed it out from the web, and 
25 said they read it online. 

(a) State the hypotheses for testing if the professor's predictions were inaccurate. 

(b) How many students did the professor expect to buy the book, print the book, and read 
the book exclusively online? 

(c) This is an appropriate setting for a chi-square test. List the assumptions and conditions 
required for a test and verify they are satisfied (as is always necessary). 

(d) Calculate the chi-squared statistic, the degrees of freedom associated with it, and the 
p- value. 

(e) Based on the p- value calculated in part (d), what is the conclusion of the hypothesis 
test? Interpret your conclusion in context. 

5.44 A Gallup Poll released in December 2010 asked 1019 adults living in the Continental 
U.S. about their belief in the origin of humans. These results, along with results from 
a more comprehensive poll from 2001 (that we will assume to be exactly accurate), are 
summarized in the table below: [33] 

Year 

Response 2010 2001 

Humans evolved, with God guiding (1) 38% 37% 

Humans evolved, but God had no part in process (2) 16% 12% 
God created humans in present form (3) 40% 45% 

Other / No opinion (4) 6% 6% 
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(a) Calculate the actual number of respondents in 2010 that fall in each response category. 

(b) State hypotheses for the following research question: have beliefs on the origin of human 
life changed since 2001? 

(c) Calculate the expected number of respondents in each category if the null hypothesis 
is true. 

(d) Conduct a chi-square test and state your conclusion (reminder: verify assumptions). 
5.8.7 Testing for independence in two-way tables 

5.45 Exercise 5.40 introduces data on views on full-body scans at airports and party 
affiliation. The differences in each political group may be due to chance. Answer each of 
the following questions under the hypothesis that party affiliation and support of full-body 
scans are independent. Complete the following computations under the null hypothesis of 
independence between an individual's party affiliation and her support of full-body scans. 

(a) How many Republicans would you expect to not support the use of full-body scans? 

(b) How many Democrats would you expect to support the use of full-body scans? 

(c) How many Independents would you expect to not know or not answer? 

5.46 Exercise 5.31 presents the results of two surveys conducted in a rural and an urban 
district. In the rural district, 172 out of 250 college-bound high school students went out 
of state for college, and in the urban district, 450 out of 930 students did. 

(a) Create a two-way table presenting the results of these two surveys. 

(b) If in fact there was no difference between the true proportions of students from the 
rural and urban district who went out of state for college, how many students from 
each district would be expected to go out of state for college? 

5.47 Researchers studying the link between prenatal vitamin use and autism surveyed the 
mothers of a random sample of children aged 24 - 60 months with autism or with typical 
development. The table below shows the number of mothers in each group who did and 
did not use prenatal vitamins during the three months before pregnancy (periconceptional 
period). [34] 

Autism 

Autism Typical development Total 
Periconceptional No 111 70 181 

prenatal vitamin Yes 143 159 302 

Total 254 229 483 

(a) State appropriate hypotheses to test for independence of use of prenatal vitamins during 
the three months before pregnancy and autism. 

(b) Complete the hypothesis test and state an appropriate conclusion. (Reminder: verify 
any necessary assumptions for the test.) 

(c) A New York Times article reporting on this study was titled "Prenatal Vitamins May 
Ward Off Autism". Do you find the title of this article to be appropriate? If not, 
explain your reasoning, and suggest a more appropriate title. [35] 
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5.48 A 2011 survey asked 806 randomly sampled adult Facebook users about their Face- 
book privacy settings. One of the questions on the survey was, "Do you know how to adjust 
your Facebook privacy settings to control what people can and cannot see?" The responses 
are cross-tabulated based on gender. [36] 

Gender 



Response 





Male 


Female 


Total 


Yes 


288 


378 


666 


No 


61 


62 


123 


Not sure 


10 


7 


17 


Total 


359 


447 


806 



(a) State appropriate hypotheses to test for independence of gender and whether or not 
Facebook users know how to adjust their privacy settings. 

(b) Complete the hypothesis test and state an appropriate conclusion. (Reminder: verify 
any necessary assumptions for the test.) 

5.49 Exercise 5.38 provides a table summarizing the responses of a random sample of 
438 college graduates and 389 non-graduates on the topic of oil drilling. Complete a chi- 
square test for these data to check whether there is a statistically significant differences in 
responses from college graduate and non-graduates. 

5.50 A December 2010 survey asked 500 randomly sampled Los Angeles residents which 
shipping carrier they prefer to use for shipping holiday gifts. The table below shows the 
distribution of responses by age group as well as the expected counts for each cell (shown 
in parentheses). 



Shipping Method 





Age 


Total 


18-34 


35-54 


55+ 


USPS 


72 


(81) 


97 


(102) 


76 


(62) 
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UPS 


52 


(53) 


76 


(68) 


34 


(41) 


162 


FedEx 


31 


(21) 


24 


(27) 


9 


(16) 


64 


Something else 


7 


(5) 


6 


(7) 


3 


(4) 


16 


Not sure 


3 


(5) 


6 


(5) 


4 


(3) 


13 


Total 


165 


209 


126 


500 



(a) State the null and alternative hypotheses for testing for independence of age and pre- 
ferred shipping method for holiday gifts among Los Angeles residents. 

(b) Are the assumptions and conditions for inference satisfied? 

(c) Is a chi-squared test appropriate for testing for independence of age and preferred 
shipping method for holiday gifts? 



Chapter 6 

Small sample inference 



Large samples are sometimes unavailable, so it is useful to study methods that apply to 
small samples. Moving from large samples to small samples creates a number of problems 
that prevent us from applying the normal model directly, though the general ideas will be 
similar to those from Chapters 4 and 5. The approach is as follows: 

• Determine what test statistic is useful. 

• Identify the distribution of the test statistic under the condition the null hypothesis 
was true. 

• Apply the ideas of Chapter 4 under the new distribution. 
This is the same approach we used in Chapter 5. 

6.1 Small sample inference for the mean 

We applied a normal model to the sample mean in Chapter 4 when (1) the observations 
were independent, (2) the sample size was at least 50, and (3) the data were not strongly 
skewed. The findings in Section 4.4 also suggested we could relax condition (3) when we 
considered ever larger samples. 

In this section, we examine the distribution of the sample mean for any sample size. To 
this end, we must strengthen the condition about the distribution of the data. Specifically, 
our data must meet two criteria: 

(1) The observations are independent. 

(2) The population distribution is nearly normal. 

If we are not confident that the data come from a nearly normal distribution, then we 
cannot apply the methods of this section. Just like before, we can relax this condition 
when the sample size becomes larger. 

Let's review our motives for requiring a large sample (for this paragraph, we will 
assume the independence and skew conditions are met). First, a large sample would ensure 
that the sampling distribution of the mean was nearly normal. Second, it also gave us 
support that the estimate of the standard error was reliable. Both of these issues seemed 
to be satisfactorily addressed when the sample size was larger than 50. Now, we'll think 
about how these issues (the shape of the sampling distribution, and the accuracy of the 
standard error estimate) change under small samples. 



241 



242 



CHAPTER 6. SMALL SAMPLE INFERENCE 




-4-2 0 2 



Figure 6.1: Comparison of a t distribution (solid line) and a normal distri- 
bution (dotted line). 

6.1.1 The normality condition 

If the individual observations are independent and come from a nearly normal population 
distribution, a special case of the Central Limit Theorem ensures the distribution of the 
sample means will be nearly normal. 



Central Limit Theorem for normal data 

The sampling distribution of the mean is nearly normal when the sample obser- 
vations are independent and come from a nearly normal distribution. This is true 
for any sample size. 



While this seems like a very helpful special case, there is one small problem. It is 
inherently difficult to verify normality in small data sets. 



Caution: Checking the normality condition 

We should exercise caution when verifying the normality condition for small sam- 
ples. It is important to not only examine the data but also think about where 
the data come from. For example, ask: would I expect this distribution to be 
symmetric, and am I confident that outliers are rare? 



6.1.2 Introducing the t distribution 

We will address the uncertainty of the standard error estimate by using a new distribution: 
the t distribution. A t distribution, shown as a solid line in Figure 6.1, has a bell shape. 
However, its tails are thicker than the normal model's. This means observations are more 
likely to fall beyond two standard deviations from the mean than under the normal distri- 
bution 1 . These extra thick tails are exactly the correction we need to resolve our problem 
with estimating the standard error. 

The t distribution, always centered at zero, has a single parameter: degrees of freedom. 
The degrees of freedom (df ) describe the exact bell shape of the t distribution. Several 



1 The standard deviation of the t distribution is actually a little more than 1. However, it is useful to 
always think of the t distribution as having a standard deviation of 1 in all of our applications. 
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Figure 6.2: The larger the degrees of freedom, the more closely the t distri- 
bution resembles the standard normal model. 

t distributions are shown in Figure 6.2. When there are more degrees of freedom, the t 
distribution looks very much like the standard normal distribution. 



Degrees of freedom 

The degrees of freedom describe the shape of the t distribution. The larger the 
degrees of freedom, the more closely the distribution approximates the normal 
model. 



When the degrees of freedom is about 50 or more, the t distribution is nearly indis- 
tinguishable from the normal distribution. In Section 6.1.4, we relate degrees of freedom 
to sample size. 

6.1.3 Working with the t distribution 

We will find it very useful to become familiar with the t distribution, because it plays a 
very similar role to the normal distribution during inference. We use a t table in place 
of the normal probability table, which is partially shown in Table 6.3. A larger table is 
presented in Appendix C.2 on page 364. 

Each row in the t table represents a t distribution with different degrees of freedom. 
The columns correspond to tail probabilities. For instance, if we know we are working with 
the t distribution with df = 18, we can examine row 18, which is highlighted in Table 6.3. 
If we want the value in this row that identifies the cutoff for an upper tail of 10%, we can 
look in the column where one tail is 0.100. This cutoff is 1.33. If we had wanted the cutoff 
for the lower 10%, we would use -1.33. Just like the normal distribution, all t distributions 
are symmetric. 

0 Example 6.1 What proportion of the t distribution with 18 degrees of freedom falls 
below -2.10? 

Just like a normal probability problem, we first draw the picture in Figure 6.4 and 
shade the area below -2.10. To find this area, we identify the appropriate row: df = 
18. Then we identify the column containing the absolute value of -2.10; it is the third 
column. Because we are looking for just one tail, we examine the top line of the table, 
which shows that a one tail area for a value in the third row corresponds to 0.025. 
About 2.5% of the distribution falls below -2.10. In the next example we encounter 
a case where the exact t value is not listed in the table. 
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one tail 
two tails 


0.100 
0.200 


0.050 
0.100 


0.025 
0.050 


0.010 
0.020 


0.005 
0.010 


df 1 


3.08 


6.31 


12.71 


31.82 


63.66 


2 


1.89 


2.92 


4.30 


6.96 


9.92 


3 


1.64 


2.35 


3.18 


4.54 


5.84 


17 


1.33 


1.74 


2.11 


2.57 


2.90 


18 


1.33 


1.73 


2.10 


2.55 


2.88 


19 


1.33 


1.73 


2.09 


2.54 


2.86 


20 


1.33 


1.72 


2.09 


2.53 


2.85 


400 


1.28 


1.65 


1.97 


2.34 


2.59 


500 


1.28 


1.65 


1.96 


2.33 


2.59 


oo 


1.28 


1.64 


1.96 


2.33 


2.58 



Table 6.3: An abbreviated look at the t table. Each row represents a dif- 
ferent t distribution. The columns describe the tail areas at each standard 
deviation. The row with df = 18 has been highlighted. 




-4-2 0 2 



Figure 6.4: The t distribution with 18 degrees of freedom. The area below 
-2.10 has been shaded. 

0 Example 6.2 At distribution with 20 degrees of freedom is shown in the left panel 
of Figure 6.5. Estimate the proportion of the distribution falling above 1.65. 

We identify the row in the t table using the degrees of freedom: df = 20. Then we 
look for 1.65; it is not listed. It falls between the first and second columns. Since 
these values bound 1.65, their tail areas will bound the tail area corresponding to 
1.65. We identify the one tail area of the first and second columns, 0.050 and 0.10, 
and we conclude that between 5% and 10% of the distribution is more than 1.65 
standard deviations above the mean. If we like, we can identify the precise area using 
statistical software: 0.0573. 

0 Example 6.3 A t distribution with 2 degrees of freedom is shown in the right panel 
of Figure 6.5. Estimate the proportion of the distribution falling more than 3 units 
from the mean (above or below). 

As before, first identify the appropriate row: df = 2. Next, find the columns that 
capture 3; because 2.92 < 3 < 4.30, we use the second and third columns. Finally, 
we find bounds for the tail areas by looking at the two tail values: 0.05 and 0.10. We 
use the two tail values because we are looking for two (symmetric) tails. 
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-4 -2 0 2 4 -4 -2 0 2 4 

Figure 6.5: Left: The t distribution with 20 degrees of freedom, with the 
area above 1.65 shaded. Right: The t distribution with 2 degrees of free- 
dom, and the area further than 3 units from 0 has been shaded. 

Q Exercise 6.4 What proportion of the t distribution with 19 degrees of freedom 
falls above -1.79 units? Answer in the footnote 2 . 

6.1.4 The t distribution as a solution to the standard error problem 

When estimating the mean and standard error from a small sample, the t distribution is a 
more accurate tool than the normal model. 

TIP: When to use the t distribution 

When observations are independent and nearly normal, we can use the t distribution 
for inference of the sample mean. 



We use the t distribution instead of the normal model because we have extra uncer- 
tainty in the estimate of the standard error. To proceed with the t distribution for inference 
about a single mean, we must check two conditions. 

• Independence of observations: We verify this condition exactly as before. We either 
collect a simple random sample from less than 10% of the population or, if it was an 
experiment or random process, carefully ensure to the best of our abilities that the 
observations were independent. 

• Observations come from a nearly normal distribution: This second condition is more 
difficult to verify since we are usually working with small data sets. Instead we often 
(i) take a look at a plot of the data for obvious departures from the normal model 
and (ii) consider whether any previous experiences alert us that the data may not be 
nearly normal. 

When examining a sample mean and estimated standard error from a sample of n inde- 
pendent and nearly normal observations, we will use a t distribution with n — 1 degrees of 
freedom (df). For example, if the sample size was 19, then we would use the t distribution 
with df = 19 — 1 = 18 degrees of freedom and proceed exactly as we did in Chapter 4, 
except that now we use the t table. 

We can relax the normality condition for the observations when the sample size be- 
comes large. For instance, a slightly skewed data set might be acceptable if there were at 
least 15 observations. For a strongly skewed data set, we might require 30 or 40 observa- 
tions. For an extremely skewed data set, perhaps 100 or more. 

2 We finding the shaded area above -1.79 (we leave the picture to you). The small left tail is between 
0.025 and 0.05, so the larger upper region must have an area between 0.95 and 0.975. 



246 



CHAPTER 6. SMALL SAMPLE INFERENCE 



6.1.5 One sample confidence intervals with small n 

Dolphins are at the top of the oceanic food chain, which causes dangerous substances such 
as mercury to concentrate in their organs and muscles. This is an important problem for 
both dolphins and other animals, like humans, who occasionally eat them. For instance, 
this is particularly relevant in Japan where school meals have included dolphin at times. 




Figure 6.6: A Risso's dolphin. 

Photo by Mike Baird (http://www.bairdphotos.com/). 



Here we identify a confidence interval for the average mercury content in dolphin 
muscle using a sample of 19 Risso's dolphins from the Taiji area in Japan 3 . The data are 
summarized in Table 6.7. The minimum and maximum observed values can be used to 
evaluate whether or not there are any extreme outliers or obvious skew. 

n x s minimum maximum 
19 4.4 2.3 1.7 9.2 



Table 6.7: Summary of mercury content in the muscle of 19 Risso's dolphins 
from the Taiji area. Measurements are in /ig/wet g (micrograms of mercury 
per wet gram of muscle) . 



Q Exercise 6.5 Are the independence and normality conditions satisfied for this date 
set? Answer in the footnote 4 . 



Multiplication 

factor for 

t conf. interval 



In the normal model, we used z* and the standard error to determine the width of a 
confidence interval. When we have a small sample, we try the t distribution instead of the 
normal model: 

x±t%SE 

The sample mean and estimated standard error are computed just as before (x = 4.4 and 



3 Taiji was featured in the movie The Cove, and it is a significant source of dolphin and whale meat in 
Japan. Thousands of dolphins pass through the Taiji area annually, and we will assume these 19 dolphins 
represent a random sample from those dolphins. Data reference: Endo T and Haraguchi K. 2009. High 
mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin 
60(5) :743- 747. 

4 The observations are a random sample and consist of less than 10% of the population, therefore 
independence is reasonable. The summary statistics in Table 6.7 do not suggest any strong skew or outliers, 
which is encouraging. Based on this evidence - and that we don't have any clear reasons to believe the 
data are not roughly normal - the normality assumption is reasonable. 
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SE = s/y/ri = 0.528), while the value tj* is a change from our previous formula. Here t*^ 
corresponds to the appropriate cutoff from the t distribution with df degrees of freedom, 
which is identified below. 







Degrees of freedom for a single sample 

If our sample has n observations and we are examining a single mean, then we use 
the t distribution with df = n — 1 degrees of freedom. 





Applying the rule in our current example, we should use the t distribution with df = 
19—1 = 18 degrees of freedom. To build a 95% confidence interval, we will use the 
abbreviated t table on page 244 where each tail has 2.5% (both tails total to 5%), which is 
the third column. Then we identify the row with 18 degrees of freedom to obtain t* s = 2.10. 
Generally the value of t*/t is slightly larger than what we would expect under the normal 
model with z* . 

Finally, we can substitute all our values into the confidence interval equation to create 
the 95% confidence interval for the average mercury content in muscles from Risso's dolphins 
that pass through the Taiji area: 

x±t* 18 SE — > 4.4 ± 2.10 * 0.528 -> (3.87,4.93) 

We are 95% confident the average mercury content of muscles in Risso's dolphins is between 
3.87 and 4.93 /zg/wet gram. This falls below the US safety limit, which is 0.5 fig per wet 
gram 5 . 



Finding a t confidence interval for the mean 

Based on a sample of n independent and nearly normal observations, a confidence 
interval for the population mean is 

x ± t* df SE 

where x is the sample mean, t* d ^ corresponds to the confidence level and degrees of 
freedom, and SE is the standard error as estimated by the sample. 



Q Exercise 6.6 The FDA's webpage provides some data on mercury content of fish''. 
Based on a sample of 15 croaker white fish (Pacific), a sample mean and standard 
deviation were computed as 0.287 and 0.069 ppm (parts per million), respectively. 
The 15 observations ranged from 0.18 to 0.41 ppm. We will assume these observations 
are independent. Based on the summary statistics of the data, do you have any 
objections to the normality condition of the individual observations? Answer in the 
footnote' . 



5 http://www. ban.org/ban-hg-wg/Mercury. ToxicTimeBomb.Final.PDF 
6 http: / / www.fda.gov/Food /FoodSafety/Product-SpecificInformation/Seafood / 
FoodbornePathogensContaminants/Methylmercury/ucml 15644.htm 

7 There are no extreme outliers; all observations are within 2 standard deviations of the mean. If there 
is skew, it is not evident. There are no red flags for the normal model based on this (limited) information, 
and we do not have reason to believe the mercury content is not nearly normal in this type of fish. 
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Figure 6.8: Sample distribution of improvements in SAT scores after taking 
the SAT course. 



0 Example 6.7 Estimate the standard error of x = 0.287 ppm from the statistics in 
Exercise 6.6. If we are to use the t distribution to create a 90% confidence interval for 
the actual mean of the mercury content, identify the degrees of freedom we should 
use and also find t^. 

SE = = 0.0178 and df = n — 1 = 14. Looking in the column where two tails 

is 0.100 (since we want a 90% confidence interval) and row df = 14, we identify 
t* u = 1.76. 

Q Exercise 6.8 Based on the results of Exercise 6.6 and Example 6.7, compute a 90% 
confidence interval for the average mercury content of croaker white fish (Pacific). 
Answer in the footnote 8 . 

6.1.6 One sample t tests with small n 

An SAT preparation company claims that its students' scores improve by over 100 points 
on average after their course. A consumer group would like to evaluate this claim, and 
they collect data on a random sample of 30 students who took the class. Each of these 
students took the SAT before and after taking the company's course, and we would like 
to examine the differences in these scores to evaluate the company's claim. (This was 
originally paired data, so we use the differences; see Section 5.1 for more details on paired 
data.) The distribution of the difference in scores, shown in Figure 6.8, has mean 135.9 
and standard deviation 82.2. Do these data provide convincing evidence to back up the 
company's claim? 

Q Exercise 6.9 Set up hypotheses to evaluate the company's claim. Use \x to repre- 
sent the true average difference in student scores. Answer in the footnote 9 . 

Q Exercise 6.10 Are the conditions to use the t distribution method satisfied? An- 
swer in the footnote 10 . 

8 Use x ± t* 1A SE: 0.287 ± 1.76 * 0.0178. This corresponds to (0.256, 0.318). We are 90% confident that 
the average mercury content of croaker white fish (Pacific) is between 0.256 and 0.318 ppm. 

9 This is a one-sided test. Ho: student scores do not improve by more than 100 after taking the 
company's course, fi < 100 (or simply fi = 100). H^: students scores improve by more than 100 points on 
average after taking the company's course, fi > 100. 

10 This is a random sample from less than 10% of the company's students (assuming they have more 
than 300 former students), so the independence condition is reasonable. The normality condition also seems 
reasonable based on Figure 6.8. We can use the t distribution method. 
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Just as we did for the normal case, we standardize the sample mean using the Z score 
to identify the test statistic. However, we will write T instead of Z, because we have a T 
small sample and are basing our inference on the t distribution: T score 

„ x- null value 135.9 - 100 

2.39 



SE 82.2/v^ 

If the null hypothesis was true, the test statistic T would follow a t distribution with 
df = n — 1 = 29 degrees of freedom. We can draw a picture of this distribution and 
mark the observed T, as in Figure 6.9. The shaded right tail represents the p-value: the 
probability of observing such strong evidence in favor of the SAT company's claim, if the 
average student improvement is really only 100. 

Q Exercise 6.11 Use the t table in Appendix C.2 on page 364 to identify the p-value. 
What do you conclude? Answer in the footnote 11 . 

Q Exercise 6.12 Because we rejected the null hypothesis, does this mean that taking 
the company's class improves student scores by more than 100 points on average? 
Answer in the footnote 12 . 



(like Z score) 



6.2 The t distribution for the difference of two means 

It is useful to be able to compare two means. For instance, a teacher might like to test the 
notion that two versions of an exam were equally difficult. She could do so by randomly 
assigning each version to students. If she found that the average scores on the exams were 
so different that we cannot write it off as chance, then she may want to award extra points 
to students who took the more difficult exam. 

In a medical context, we might investigate whether embryonic stem cells (ESCs) can 
improve heart pumping capacity in individuals who have suffered a heart attack. We could 
look for evidence of greater heart health in the ESC group against a control group. 

11 We use the row with 29 degrees of freedom. The value T = 2.39 falls between the third and fourth 
columns. Because we are looking for a single tail, this corresponds to a p-value between 0.01 and 0.025. The 
p-value is guaranteed to be less than 0.05 (the default significance level), so we reject the null hypothesis. 
The data provide convincing evidence to support the company's claim that student scores improve by more 
than 100 points following the class. 

12 This is an observational study, so we cannot make this causal conclusion. For instance, maybe SAT 
test takers tend to improve their score over time even if they don't take a special SAT class, or perhaps 
only the most motivated students take such SAT courses. 
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The ability to make conclusions about a difference in two means, fix — [M2, is often 
useful. If the sample sizes are small and the data are nearly normal, the t distribution 
can be applied to the sample difference in means, x\ — 5a, to make inference about the 
difference in population means. 

6.2.1 Sampling distributions for the difference in two means 

In the example of two exam versions, the teacher would like to evaluate whether there is 
convincing evidence that the difference in average scores is not due to chance. 

It will be useful to extend the t distribution method from Section 6.1 to apply to a 
new point estimate: 



Just as we did in Section 5.2, we verify conditions for each sample separately and then 
verify that the samples are also independent. For instance, if the teacher believes students 
in her class are independent, the exam scores are nearly normal, and the students taking 
each version of the exam were independent, then we can use the t distribution for the 
sampling distribution of the point estimate, x\ — x%. 

The formula for the standard error of X\ — x 2 , introduced in Section 5.2, remains useful 
for small samples: 



Because we will use the t distribution, we will need to identify the appropriate degrees of 
freedom. This can be done using computer software. An alternative technique is to use 
the smaller of n% — 1 and n% — 1, which is the method we will apply in the examples and 
exercises 13 . 







Using the t distribution for a difference in means 

The t distribution can be used for the (standardized) difference of two means if 
(1) each sample meets the conditions for the t distribution and (2) the samples are 
independent. We estimate the standard error of the difference of two means using 
Equation (6.13). 





6.2.2 Two sample t test 

Summary statistics for each exam version are shown in Table 6.10. The teacher would like 
to evaluate whether this difference is so large that it provides convincing evidence that 
Version B was more difficult (on average) than Version A. 

Q Exercise 6.14 Construct a two-sided hypothesis test to evaluate whether the ob- 
served difference in sample means, xa — xb = 5.3, might be due to chance. Answer 
in the footnote 14 . 

13 This technique for degrees of freedom is conservative with respect to a Type 1 Error; it is more difficult 
to reject the null hypothesis using this df method. 

14 Because the professor did not expect one exam to be more difficult prior to examining the test results, 
she should use a two-sided hypothesis test. Hq: the exams are equally difficult, on average, ji^ — ji B = 0. 
Hj^: one exam was more difficult than the other, on average, /i^ — fig ^ 0. 



Xi - X 2 




(6.13) 
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Version 


n 


X 


s 


min 


max 


A 


30 


79.4 


14 


45 


100 


B 


27 


74.1 


20 


32 


100 



Table 6.10: Summary statistics of scores for each exam version. 



-3-2-10 1 2 3 

Figure 6.11: The t distribution with 26 degrees of freedom. The shaded 
right tail represents values with T > 1.15. Because it is a two-sided test, 
we also shade the corresponding lower tail. 



Q Exercise 6.15 To evaluate the hypotheses in Exercise 6.14 using the t distribution, 
we must first verify assumptions, (a) Does it seem reasonable that the scores are 
independent? (b) What about the normality condition for each group? (c) Do you 
think each group would be independent? Answer in the footnote 15 . 

After verifying the conditions for each sample and confirming the samples are inde- 
pendent of each other, we are ready to conduct the test using the t distribution. In this 
case, we are estimating the true difference in average test scores using the sample data, so 
the point estimate is xa — %b = 5.3. The standard error of the estimate can be calculated 
using Equation (6.13): 




= 4.62 



Finally, we construct the test statistic: 

point estimate — null value (79.4 — 74.1) — 0 
~~ ~SE ~~ 4~62 ~ 

If we have a computer handy, we can identify the degrees of freedom as 45.97. Otherwise 
we use the smaller of rii — 1 and — 1: df = 26. 

Q Exercise 6.16 Identify the p-value, shown in Figure 6.11. Use df = 26. Answer in 
the footnote 16 . 

15 (a) It is probably reasonable to conclude the scores are independent, (b) The summary statistics 
suggest the data are roughly symmetric about the mean, and it doesn't seem unreasonable to suggest the 
data might be normal, (c) It seems reasonable to suppose that the samples are independent since the exams 
were handed out randomly. 

16 We examine row df = 26 in the t table. Because this value is smaller than the value in the left column, 
the p-value is at least 0.200 (two tails!). Because the p-value is so large, we do not reject the null hypothesis. 
That is, the data do not convincingly show that one exam version is more difficult than the other, and the 
teacher is not convinced that she should add points to the Version B exam scores. 
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n 


X 


s 


ESCs 


9 


3.50 


5.17 


control 


9 


-4.33 


2.76 



Table 6.12: Summary statistics of scores, split by exam version. 



In Exercise 6.16, we could have used df = 45.97. However, this value is not listed in 
the table. In such cases, we use the next lower degrees of freedom (unless the computer 
also provides the p- value). For example, we could have used df = 45 but not df = 46. 

Do embryonic stem cells (ESCs) help improve heart function following a heart attack? 
Table 6.12 contains summary statistics for an experiment to test ESCs in sheep that had 
a heart attack. Each of these sheep was randomly assigned to the ESC or control group, 
and the change in their hearts' pumping capacity was measured. A positive value generally 
corresponds to increased pumping capacity, which suggests a stronger recovery. We will 
consider this study in the exercises and examples below. 

Q Exercise 6.17 Set up hypotheses that will be used to test whether there is con- 
vincing evidence that ESCs actually increase the amount of blood the heart pumps. 
Answer in the footnote 1 ' . 



Example 6.18 The raw data from the ESC experiment described in Exercise 6.17 
may be viewed in Figure 6.13. Using 8 degrees of freedom for the t distribution, 
evaluate the hypotheses. 



We first compute the point estimate of the difference along with the standard error: 

•^esc "^control 7.88 



V 9 9 

The p- value is depicted as the shaded right tail in Figure 6.14, and the test statistic 
is computed as follows: 

nn 7.88-0 , 

T = = 4.03 

1.95 

We use the smaller of n\ — 1 and ri2 — 1 (each are the same) for the degrees of 
freedom: df = 8. Finally, we look for T = 4.03 in the t table; it falls to the right of 
the last column, so the p- value is smaller than 0.005 (one tail!). Because the p-value 
is less than 0.005 and therefore also smaller than 0.05, we reject the null hypothesis. 
The data provide convincing evidence that embryonic stem cells improve the heart's 
pumping function in sheep that have suffered a heart attack. 



17 We first setup the hypotheses: 

ifrj' The stem cells do not improve heart pumping function. /i csc — ^control = 0. 

The stem cells do improve heart pumping function. fi e sc — ^control > 0. 

Before we move on, we must first verify that the t distribution method can be applied. Because the 
sheep were randomly assigned their treatment and, presumably, were kept separate from one another, the 
independence assumption is verified for each sample as well as for between samples. The data are very 
limited, so we can only check for obvious outliers in the raw data in Figure 6.13. Since the distributions are 
(very) roughly symmetric, we will assume the normality condition is acceptable. Because the conditions 
are satisfied, we can apply the t distribution. 
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Figure 6.13: Histograms for both the embryonic stem cell group and the 
control group. Higher values are associated with greater improvement. 



I I I I I I I I 
_6 -4 -2 0 2 4 6 8 

Figure 6.14: Distribution of the sample difference of the mean improvements 
if the null hypothesis was true. The shaded area represents the p-value. 
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6.2.3 Two sample t confidence interval 

Based on the results of Exercise 6.17, you found significant evidence that ESCs actually 
help improve the pumping function of the heart. But how large is this improvement? To 
answer this question, we can use a confidence interval. 

Q Exercise 6.19 In Exercise 6.17, you found that the point estimate, x esc —x 'control — 
7.88, has a standard error of 1.95. Using df = 8, create a 99% confidence interval for 
the improvement due to ESCs. Answer in the footnote 18 . 



6.2.4 Pooled standard deviation estimate (special topic) 

Occasionally, two populations will have standard deviations that are so similar, that they 
can be treated as identical. For example, historical data or a well-understood biological 
mechanism may justify this strong assumption. In such cases, we can make our t distribu- 
tion approach slightly more precise by using a pooled standard deviation. 

The pooled standard deviation of two groups is a way to use data from both 
samples to better estimate the standard deviation and standard error. If s 1 and s 2 are 
the standard deviations of groups 1 and 2 and there are good reasons to believe that the 
population standard deviations are equal, then we can obtain an improved estimate of the 
group variances by pooling their data: 

2 _ g| * {n\ - 1) + g| * (n 2 - 1) 

Spooled ~~ , ~~ o 

' Til + n 2 — 2 

where n\ and n 2 are the sample sizes, as before. To utilize this new statistic, we substitute 
Spooled m pl ace °f s i an d s\ in the standard error formula, and we use an updated formula 
for the degrees of freedom: 



df = ri\ + n% — 2 



The benefits of pooling the standard deviation are realized through obtaining a bet- 
ter estimate of the population standard deviations and using a larger degrees of freedom 
parameter for the t distribution. Both of these changes may permit a better model of the 
sampling distribution of x\ — x\. 



Caution: Pooling standard deviations should be done only after careful 
research 

A pooled standard deviation is only appropriate when background research indi- 
cates the population standard deviations are nearly equal. When the sample size 
is large and the condition may be adequately checked with data, the benefits of 
pooling the standard deviations greatly diminishes. 



18 We know the point estimate, 7.88, and the standard error, 1.95. We also verified the conditions for 
using the t distribution in Exercise 6.17. Thus, we only need identify tg to create a 99% confidence interval: 
tg = 3.36. Thus, the 99% confidence interval for the improvement from ESCs is given by 

7.88 ±3.36* 1.95 -> (1.33,14.43) 

That is, we are 99% confident that the true improvement in heart pumping function is somewhere between 
1.33% and 14.43%. 
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6.3 Small sample hypothesis testing for a proportion 
(special topic) 

In this section we develop inferential methods for a single proportion that are appropriate 
when the sample size is too small to apply the normal model to p. Just like the other small 
sample techniques, the methods introduced here can be applied to large samples. 

6.3.1 When the success-failure condition is not met 

People providing an organ for donation sometimes seek the help of a special "medical 
consultant" . These consultants assist the patient in all aspect of the surgery, with the goal 
of reducing the possibility of complications during the medical procedure and recovery. 
Patients might choose a consultant based in part on the historical complication rate of 
the consultant's clients. One consultant tried to attract patients by noting the average 
complication rate for liver donor surgeries in the US is about 10%, but her clients have 
only had 3 complications in the 62 liver donor surgeries she has facilitated. She claims this 
is strong evidence that her work meaningfully contributes to reducing complications (and 
therefore she should be hired!). 

Q Exercise 6.20 We will let p represent the true complication rate for liver donors 
working with this consultant. Estimate p using the data, and label this value p. 

0 Example 6.21 Is it possible to assess the consultant's claim using the data pro- 
vided? 

No. The claim is that there is a causal connection, but the data are observational. 
Patients who hire this medical consultant may have lower complication rates for other 
reasons. 

While it is not possible to assess this causal claim, it is still possible to test for an 
association using these data. For this question we ask, Could the low complication 
rate of p = 3/62 = 0.048 be due to chance? 

Q Exercise 6.22 Write out hypotheses in both plain and statistical language to test 
for the association between the consultant's work and the true complication rate, p, 
for this consultant's clients. Answer in the footnote 19 . 

0 Example 6.23 In the examples based on large sample theory, we modeled p using 
the normal distribution. Why is this not appropriate here? 

The independence assumption may be reasonable if each of the surgeries is from a 
different surgical team. However, the success-failure condition is not satisfied. Under 
the null hypothesis, we would anticipate seeing 62 * 0.10 = 6.2 complications, not the 
10 required for the normal approximation. 

The uncertainty associated with the sample proportion should not be modeled using 
the normal distribution. However, we would still like to assess the hypotheses from Exer- 
cise 6.22 in absence of the normal framework. To do so, we need to evaluate the possibility 

19 Hq: There is no association between the consultant's contributions and the clients' complication rate. 
In statistical language, p = 0.10. H^: Patients who work with the consultant tend to have a complication 
rate lower than 10%, i.e. p < 0.10. 
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of a sample value (p) this far below the null value, po = 0.10. This possibility is usually 
measured with a p-value. 

The p-value is computed based on the null distribution, which is the distribution of 
the test statistic if the null hypothesis is true. Supposing the the null hypothesis is true, we 
can compute the p-value by identifying the chance of observing a test statistic that favors 
the alternative hypothesis at least as strongly as the observed test statistic. In other words, 
we compute the tail area (or areas) to identify the p-value. 

6.3.2 Generating the null distribution and p-value by simulation 

We want to identify the sampling distribution of the test statistic (p) if the null hypothesis 
was true. In other words, we want to see how the sample proportion changes due to chance 
alone. Then we plan to use this information to decide whether there is enough evidence to 
reject the null hypothesis. 

Under the null hypothesis, 10% of liver donors have complications during or after 
surgery. Suppose this rate was really no different for the consultant's clients. If this was 
the case, we could simulate 62 clients to get a sample proportion for the complication rate 
from the null distribution. 

Each client can be simulated using a deck of cards. Take one red card, nine black cards, 
and mix them up. Then drawing a card is one way of simulating the chance a patient has 
a complication if the true complication rate is 10% for the data. If we do this 62 times and 
compute the proportion of patients with complications in the simulation, p s i m , then this 
sample proportion is exactly a sample from the null distribution. 

An undergraduate student was paid $2 to complete this simulation. There were 5 
simulated cases with a complication and 57 simulated cases without a complication, i.e. 
p slm = 5/62 = 0.081. 

0 Example 6.24 Is this one simulation enough to determine whether or not we should 
reject the null hypothesis from Exercise 6.22? Explain. 

No. To assess the hypotheses, we need to see a distribution of many p S i m , not just a 
single draw from this sampling distribution. 

One simulation isn't enough to get a sense of the null distribution; many simulation 
studies are needed. Roughly 10,000 seems sufficient. However, paying someone to simulate 
10,000 studies by hand is a waste of time and money. Instead, simulations are typically 
programmed into a computer, which is fast and cheap. 

Figure 6.15 shows the results of 10,000 simulated studies. The proportions that are 
equal to or less than p = 0.048 are shaded. The shaded areas represent sample proportions 
under the null distribution that provide at least as much evidence as p favoring the alter- 
native hypothesis. There were 1222 simulated sample proportions with p s i rn < 0.048. We 
use these to construct the null distribution's left-tail area and find the p-value: 



Of the 10,000 simulated p s i m , 1222 were equal to or smaller than p. Since the hypothesis 
test is one-sided, the estimated p-value is equal to this tail area: 0.1222. 



left tail = 



Number of observed simulations with p s im < 0.048 



(6.25) 



10000 



<3 Exercise 6.26 Based on the estimated p-value of 0.1222, should the null hypothesis 
be rejected? Give a brief explanation. 
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Figure 6.15: The null distribution for p, created from 10,000 simulated 
studies. The left tail, representing the p-value for the hypothesis test, 
contains 12.22% of the simulations. 



Q Exercise 6.27 Because the estimated p-value is 0.1222, which is larger than the 
significance level 0.05, we do not reject the null hypothesis. Explain what this means 
in plain language in the context of the problem. Answer in the footnote 20 . 

Q Exercise 6.28 Does the conclusion in Exercise 6.27 imply there is no real asso- 
ciation between the surgical consultant's work and fewer complications? Explain. 
Answer in the footnote 21 . 



One-sided hypothesis test for p with a small sample 

The p-value is always derived by analyzing the null distribution of the test statistic. 
The normal model poorly approximates the null distribution for p when the success- 
failure condition is not satisfied. As a substitute, we generate the null distribution 
using simulated sample proportions (p s im) an d use this distribution to compute 
the tail area, i.e. the p-value. 



How do we compute the p-value when the test is two-sided? We continue to use the 
same rule as before: double the single tail area, which remains a reasonable approach even 
when the sampling distribution is assymmetric. However, this can result in p-values larger 
than 1 when the point estimate is very near the mean in the null distribution; in such cases, 
we write that the p-value is 1. Also, very large p-values computed in this way (e.g. 0.85), 
may also be slightly inflated. 

Exercises 6.26 and 6.27 said the p-value is estimated. It is not exact because the 
simulated null distribution itself is not exact, only a close approximation. However, we can 
generate an exact null distribution and p-value using the binomial model from Section 3.4. 

20 There isn't sufficiently strong evidence to support an association between the consultant's work and 
fewer surgery complications. 

21 No. It might be that the consultant's work is associated with a reduction but that there isn't enough 
data to convincingly show this connection. 
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6.3.3 Generating the exact null distribution and p-value 

The number of successes in n independent cases can be described using the binomial model, 
which was introduced in Section 3.4. Recall that the probability of observing exactly k 
successes is given by 

P(k successes) = " p) n ~* = k]{ ™[ k) / ( l ~ f)""* ( 6 ' 29 ) 

where p is the true probability of success. The expression (?) is read as n choose k, and 
the exclamation points represent factorials. For instance, 3! is equal to 3 * 2 * 1 = 6, 4! is 
equal to 4*3*2*1 = 24, and so on (see Section 3.4). 

The tail area of the null distribution is computed by adding up the probability in 
Equation (6.29) for each k that provides at least as strong of evidence favoring the al- 
ternative hypothesis as the data. If the hypothesis test is one-sided, then the p-value is 
represented by a single tail area. If the test is two-sided, compute the single tail area and 
double it to get the p-value, just as we have done in the past. 

0 Example 6.30 Compute the exact p-value to check the consultant's claim that her 
clients' complication rate is below 10%. 

Exactly k = 3 complications were observed in the n = 62 cases cited by the consultant. 
Since we are testing against the 10% national average, our null hypothesis is p = 0.10. 
We can compute the p-value by adding up the cases where there are 3 or fewer 
complications: 

p-value = J2 ("Vt 1 " pT~ 3 



3=0 



J2^y.V(l-0.lf 2 -i 

~ o.i) 62 -° + (^o.i^i - o.i) 6 



3=0 

62 
0. 

+ ( 6 2 2 )o.i 2 (i - o.i) 62 - 2 + ( 6 3 2 )o.i 3 (i - o.i) 62 - 3 

= 0.0015 + 0.0100 + 0.0340 + 0.0755 
= 0.1210 

This exact p-value is very close to the p-value based on the simulations (0.1222), and 
we come to the same conclusion. We do not reject the null hypothesis, and there is 
not statistically significant evidence to support the association. 

If it were plotted, the exact null distribution would look almost identical to the 
simulated null distribution shown in Figure 6.15 on page 257. 



6.4 Hypothesis testing for two proportions 
(special topic) 

Cardiopulmonary resuscitation (CPR) is a procedure commonly used on individuals suf- 
fering a heart attack when other emergency resources are not available. This procedure is 
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helpful in maintaining some blood circulation, but the chest compressions involved can also 
cause internal injuries. Internal bleeding and other injuries complicate additional treat- 
ment efforts following arrival at a hospital. For instance, blood thinners may be used to 
help release a clot that is causing the heart attack. However, the blood thinner would 
have negative repercussions on any internal injuries. Here we consider an experiment 22 
for patients who underwent CPR for a heart attack and were subsequently admitted to a 
hospital. These patients were randomly divided into a treatment group where they received 
a blood thinner or the control group where they did not receive the blood thinner. The 
outcome variable of interest was whether the patients survived for at least 24 hours. 

0 Example 6.31 Form hypotheses for this study in plain and statistical language. 
Let p c represent the true survival proportion in the control group and p t represent 
the survival proportion for the treatment group. 



We are interested in whether the blood thinners are helpful or hurtful, so this should 
be a two-sided test. 

Hq: Blood thinners do not have an overall effect on survival, i.e. the survival pro- 
portions are the same in each group, pt — p c = 0. 

Ha'- Blood thinners do have an impact on survival, pt — p c 7^ 0. 

6.4.1 Large sample framework for a difference in two proportions 

There were 50 patients in the experiment who did not receive the blood thinner and 40 
patients who did. The study results are shown in Table 6.16. 





Survived 


Died 


Total 


Control 


11 


39 


50 


Treatment 


14 


26 


40 


Total 


25 


65 


90 



Table 6.16: Results for the CPR study. Patients in the treatment group 
were given a blood thinner, and patients in the control group were not. 



Q Exercise 6.32 What is the sample survival proportion of the control group? Of the 
treatment group? Provide a point estimate of the difference in survival proportions 
of the two groups: pt — p c - Answer in the footnote 23 . 

According to the point estimate, there is a 13% increase in the survival proportion 
when patients who have undergone CPR outside of the hospital are treated with blood 
thinners. However, we wonder if this difference could be due to chance. We'd like to 
investigate this using a large sample framework, but we first need to check the conditions 
for such an approach. 



22 Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: 
a prospective clinical trial, by Bottiger et al., The Lancet, 2001. 
23 p t -p c = 0.35 - 0.22 = 0.13 
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0 Example 6.33 Can the point estimate of the difference in survival proportions be 
adequately modeled using a normal distribution? 

We will assume the patients are independent, which is probably reasonable. The 
success-failure condition is also satisfied. There were at least 10 successes and 10 
failures in each group. 

While we can apply a normal framework as an approximation to find a p-value, we 
might keep in mind that there were just 11 successes in one group and 14 in the other. 
Below we conduct an analysis relying on the large sample normal theory. We will follow 
up with a small sample analysis and compare the results. 

0 Example 6.34 Assess the hypotheses presented in Example 6.31 using a large sam- 
ple framework. Use a significance level of a = 0.05. 



We suppose the null distribution of the sample difference follows a normal distribution 
with mean 0 (the null value) and a standard deviation equal to the standard error 
of the estimate. Because the null hypothesis in this case would be that the two 
proportions are the same, we compute the standard error using the pooled standard 
error formula from Equation (5.30) on page 211: 



S£ = / P(l-P) P(l ~ P) w / 0.278(1 - 0.278) 0.278(1 - 0.278) = Q ^ 



n t n c V 40 50 

where we have used the pooled estimate (ft = ggqEgg = 0.278^ in place of the true 
proportion, p. 

The null distribution with mean zero and standard deviation 0.095 is shown in Fig- 
ure 6.17. We compute the tail areas to identify the p-value. To do so, we use the Z 
score of the point estimate: 

7 _ (Pt - Pc) ~ null value _ 0.13 - 0 

Z ~ SE " ~ll095~ " L3? 

If we look this Z score up in Appendix C.l, we see that the right tail has area 
0.0853. The p-value is twice the single tail area: 0.176. This p-value does not provide 
convincing evidence that the blood thinner helps. Thus, there is insufficient evidence 
to conclude whether or not the blood thinner helps or hurts. (Remember, we never 
"accept" the null hypothesis - we can only reject or fail to reject.) 

The p-value given here, 0.176, relies on the normal approximation. We know that as 
the samples sizes are large, this approximation is quite good. However, when the sample 
sizes are relatively small - the success failure condition is either not satisfied or is just 
barely satisfied - the approximation may only be adequate. Next we develop a small 
sample technique, apply it to these data, and compare our results. In general, the small 
sample method we develop may be used for any size sample, small or large, and should be 
considered as more accurate than the corresponding large sample technique. 

6.4.2 Simulating a difference under the null distribution 

The ideas in this section were first introduced in the optional Section 1.8 on page 36. For 
the interested reader, this earlier section provides a more in-depth discussion. 
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Figure 6.17: The null distribution of the point estimate, pt — p c , under the 
large sample framework is a normal distribution with mean 0 and standard 
deviation equal to the standard error, in this case SE = 0.095. The p- value 
is represented as sample differences in the null distribution that provide 
greater evidence for the alternative hypothesis than what the observed data 
present. The p-value is represented by the shaded areas. 

Suppose the null hypothesis is true. Then the blood thinner has no impact on survival 
and the 13% difference was due to chance. In this case, we can simulate null differences 
that are due to chance using a randomization technique 2 " 1 . By randomly assigning "fake 
treatment" and "fake control" stickers to the patients' files, we could get a new grouping - 
one that is completely due to chance. The expected difference between the two proportions 
under this simulation is zero. 

We run this simulation by taking 40 treatmentFake and 50 controlFake labels and 
randomly assigning them to the patients. The label counts of 40 and 50 correspond to 
the number of treatment and control assignments in the actual study. We use a computer 
program to randomly assign these labels to the patients, and we organize the simulation 
results into Table 6.18. 





Survived 


Died 


Total 


controlFake 


15 


35 


50 


treatmentFake 


10 


30 


40 


Total 


25 


65 


90 



Table 6.18: Simulated results for the CPR study under the null hypothesis. 
The labels were randomly assigned and are independent of the outcome of 
the patient. 



© Exercise 6.35 What is the difference in death rates between the two fake groups 
in Table 6.18? How does this compare to the observed 13% in the real groups? 

The difference computed in Exercise 6.35 represents a draw from the null distribution 
of the sample differences. Next we generate many more simulations to build up the null 
distribution, much like we did in Section 6.3.2 to build a null distribution for one sample 
proportion. 



The test procedure we employ in this section is formally called a permutation test. 
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Figure 6.19: An approximation of the null distribution of the point estimate, 
Pt — Pc- The p- value is twice the right tail area. 



Caution: Simulation in the two proportion case requires that the null 
difference is zero 

The technique described here to simulate a difference from the null distribution 
relies on an important condition in the null hypothesis: there is no connection 
between the two variables considered. In some special cases, the null difference 
might not be zero, and more advanced methods (or a large sample approximation, 
if appropriate) would be necessary. 



6.4.3 Null distribution for the difference in two proportions 

We build up an approximation to the null distribution by repeatedly creating tables like 
the one shown in Table 6.18 and computing the sample differences. The null distribution 
from 10,000 simulations is shown in Figure 6.19. 

0 Example 6.36 Compare Figures 6.17 and 6.19. How are they similar? How are 
they different? 

The shapes are similar, but the simulated results show that the continuous approxi- 
mation of the normal distribution is not very good. We might wonder, how close are 
the p-values? 

Q Exercise 6.37 The right tail area is about 0.13. (It is only a coincidence that we 
also have Pt — Pc = 0.13.) Since this is a two-sided test, what is the p-value? 

<3 Exercise 6.38 The p-value is computed by doubling the right tail area: 0.26. How 
does this value compare with the large sample approximation for the p-value? Answer 
in the footnote 25 . 

In general, small sample methods produce more accurate results since they rely on 
fewer assumptions. However, they often require some extra work or simulations. For this 
reason, many statisticians use small sample methods only when conditions for large sample 
methods are not satisfied. 



25 The approximation in this case is fairly poor (p-values: 0.174 vs. 0.26), though we come to the same 
conclusion. The data do not provide convincing evidence showing the blood thinner hurts or helps patients. 
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6.5 Exercises 



6.5.1 Small sample inference for the mean 

6.1 An independent random sample is selected from an approximately normal population 
with unknown standard deviation. The sample is small (n < 50). Find the degrees of 
freedom and the critical t value (t*) for the given confidence level. 

(a) n = 42, CL = 90% (c) n = 29, CL = 95% 

(b) n = 21, CL = 98% (d) n = 12, CL = 99% 



6.2 A 90% confidence interval for a population mean is (65,77). The population distri- 
bution is approximately normal and the population standard deviation is unknown. Given 
that this confidence interval is calculated based on an independent random sample of size 
25, calculate the sample mean, the margin of error, and the sample standard deviation. 

6.3 For a given confidence level, t* d j is larger than z* . Explain how t*y being slightly larger 
than z* affects the width of the confidence interval. 

6.4 An independent random sample is selected from an approximately normal population 
with unknown standard deviation. The sample is small (n < 50). Find the p- value for 
the given set of hypotheses and t values. Also determine if the null hypothesis would be 
rejected at a = 0.05. 

(a) H A : (j, > fj, 0 , n = 11, T = 1.91 (c) H A : fi ^ fi 0 , n = 38, T = 0.83 

(b) H A : fi < no, n = 17, T = -3.45 (d) H A : fj, > fi 0 , n = 47, T = 2.13 

6.5 New York is known as "the city that never sleeps". A random sample of 25 New 
Yorkers were asked how much sleep they get per night. Statistical summaries of these data 
are shown below. Do these data provide strong evidence that New Yorkers sleep less than 
8 hours a night on average? 



n 


X 


s 


min 


max 


25 


7.73 


0.77 


6.17 


9.78 



(a) Write the hypotheses in symbols and in words. 

(b) Calculate the test statistic, T. (Reminder: check conditions and assumptions.) 

(c) Find and interpret the p-value in context. Drawing a picture may be helpful. 

(d) What is the conclusion of the hypothesis test? 

(e) If you were to construct a confidence interval that corresponded to this hypothesis test, 
would you expect 8 hours to be in the interval? 



264 



CHAPTER 6. SMALL SAMPLE INFERENCE 



100 



6.6 A college newspaper article claims that students at this college spend more than an 
hour per day on average on social networking sites. The article is based on a survey con- 
ducted at this college on a random sample of 45 college students who use social networking 
sites, which yielded a sample mean of 68.2 minutes with a standard deviation of 21 min- 
utes. A histogram of the data is shown below. Do these data provide strong evidence that 
students at this college who use social networking sites spend on average more than an 
hour (60 minutes) per day on such sites? 

(a) Write the hypotheses in symbols and in words. 

(b) Calculate the test statistic, T. (Reminder: , , 

check conditions and assumptions.) 

(c) Find and interpret the p- value in context. 
Drawing a picture may be helpful. 

(d) What is the conclusion of the hypothesis test? 

i i i i 

(e) If you were to construct a confidence inter- 20 40 eo so 
val that corresponded to this hypothesis test, Social network use (in min) 
would you expect 60 minutes to be in the in- 
terval? 

6.7 Exercise 6.5 provides summary statistics on the number of hours of sleep 25 randomly 
sampled New Yorkers get per night. 

(a) Calculate a 90% confidence interval for the number of hours of New Yorkers sleep on 
average and interpret this interval in context. 

(b) Using your confidence interval, would you reject the notion that New Yorkers sleep an 
average of 8 hours per night? 

6.8 Exercise 6.6 provides the mean (68.2 minutes) and standard deviation (21 minutes) 
for the time spent on social networking sites of 45 randomly sampled students (under the 
condition that they use social networking sites) at an unnamed college 

(a) Calculate a 90% confidence interval for the true average amount of time students who 
use social networking sites at this college spend on social networking sites per day. 

(b) Using your confidence interval, would you reject the notion that students at this college 
who use social networking sites spend an average of 60 minutes on these sites? 

6.9 Chain restaurants in California are required to display calorie counts of each menu 
item. Prior to October 2008, when a law went into effect that required calorie information 
on menus, the average calorie intake of a diner at a particular restaurant was 1900 calories. 
Suppose a nutritionist randomly samples 30 diners at this restaurant and finds an average 
calorie intake of 1806 calories with a standard deviation of 310 calories. 

(a) Do these data provide strong evidence that the average calorie intake has changed 
after calorie counts started to be displayed on the menus at this restaurant? You may 
assume that the distribution of the data is nearly normal. 

(b) Calculate a 95% confidence interval for the average calorie intake of diners at this 
restaurant. 



(c) Does the conclusion of your hypothesis test agree with the confidence interval you 
calculated? 
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6.10 Fueleconomy.gov, the official U.S. government source for fuel economy information, 
allows users to share gas mileage information on their vehicles. The histogram below shows 
the distribution of gas mileage (in miles per gallon, MPG) data from 25 users who drive a 
2009 Toyota Prius. The sample mean is 50.3 MPG and the standard deviation is 6.8 MPG. 
Note that these data are user estimates and since the source data cannot be verified, the 
accuracy of these estimates are not guaranteed. [37] 



6 -i 
4 - 
2 - 



40 45 50 55 60 65 

Mileage (in MPG) 

(a) We would like to use this data to evaluate the average mileage of all 2009 Prius drivers. 
Do you think this is reasonable? Why or why not? 

(b) The EPA claims that a 2009 Prius gets 46 MPG. Do these data provide strong evi- 
dence against this estimate for drivers who participate on fueleconomy.gov? Use a 10% 
significance level. 

(c) Calculate a 90% confidence interval for the average gas mileage of a 2009 Prius by 
drivers who participate on fueleconomy.gov. 

(d) Does the conclusion of your hypothesis test agree with the confidence interval you 
calculated? 

6.11 You are given the following hypotheses: 

H Q : fj, = 60 
H A ■ |U < 60 

We know that the sample standard deviation is 8 and the sample size is 20. For what sample 
mean would the p-value be equal to 0.05? Assume that all assumptions and conditions 
necessary for inference are satisfied. 

6.12 A 95% confidence interval for a population mean, /i, is given as (18.985, 21.015). 
This confidence interval is based on a simple random sample of 36 observations. Calculate 
the sample mean, x, and the sample standard deviation, s. Assume that all assumptions 
and conditions necessary for inference are satisfied. 
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6.5.2 The t distribution for the difference of two means 

6.13 Average income varies from one region of the country to another, and it often reflects 
both lifestyles and regional living expenses. Suppose a new graduate is considering a job 
in two locations, Cleveland, OH and Sacramento, CA, and she wants to see whether the 
average income in one of these cities is higher than the other. She would like to conduct a t 
test based on two small samples from the 2000 Census, but she first must consider whether 
the conditions are met to implement the test. Below are a histograms for each city. Should 
she move forward with the t test? Explain. 



Cleveland, OH 



Cleveland, OH 



Mean 

SD 

n 



26,436 
33,239 
21 



Sacramento, CA 



Sacramento, CA 



Mean 

SD 

n 



$ 32,182 
$ 40,480 
17 



0 40000 80000 120000 160000 

Total personal income 

6.14 The first Oscar award for best actor and best actresses were given out in 1929. The 
histograms below show the age distribution for all the best actor and best actress winners 
from 1929 to 2011. Summary statistics for these distributions are also provided. Is a t test 
appropriate for testing whether the difference in the sample means for age might be due to 
chance? Explain. 



Best Actress 



Best Actress 



Mean 
SD 



35.6 
11.3 



Best Actor 



Best Actor 



Mean 
SD 



44.7 
8.9 



Age (in years) 
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6.15 Two independent random samples are selected from normal populations with un- 
known standard deviations. Both samples are small (n < 50). Find the p- value for the 
given set of hypotheses and t values. Also determine if the null hypothesis would be rejected 
at a = 0.05. Remember that a reasonable choice of degrees of freedom for the two-sample 
case is the minimum of n\ — 1 and n 2 — 1. The "exact" df is something we cannot compute 
from the given information. 



(a) H A 


■ Mi > M2, 


m 


= 23, 


™2 


= 25, 


T = 


3.16 


(b) H A 


: Mi + M2, 


m 


= 38, 


™2 


= 37, 


T = 


2.72 


(c) H A 


: Mi < M2, 


m 


= 45, 


"2 


= 41, 


T = 


-1.83 


(d) H A 


: Mi + M2, 


m 


= 11, 


"2 


= 15, 


T = 


0.28 



6.16 Two independent random samples are selected from normal populations with un- 
known standard deviations. Both samples are small (n < 50). Find the degrees of freedom 
and the critical t value (t^) for the given confidence level. 

(a) m = 16, n 2 = 16, CL = 90%. (c) m = 8, n 2 = 10, CL = 99% 

(b) m = 36, n 2 = 41, CL = 95% (d) m = 23, n 2 = 27, CL = 98% 

6.17 A weight loss pill claims to accelerate weight loss when accompanied with exercise 
and diet. Diet researchers from a consumer advocacy group decided to test this claim using 
an experiment. 42 subjects were randomly assigned to two groups: 21 took the pill and 21 
only received a placebo. Both groups underwent the same diet and exercise regiment. In 
the group that received the pill the average weight loss was 20 lbs with a standard deviation 
of 4 lbs. In the placebo group the average weight loss was 18 lbs with a standard deviation 
of 5 lbs. 

(a) Calculate a 95% confidence interval for the difference between the two means and 
interpret it in context. 

(b) Based on your confidence interval, is there significant evidence that the weight loss pill 
is effective? 

(c) Does this prove that the weight loss pill is effective? 

6.18 A company has two factories in which they manufacture engines. Once a month they 
randomly select 10 engines from each factory and test if there is a difference in performance 
in engines made in the two factories. This month the average output of the motors from 
Factory 1 is 120 horsepower with a standard deviation of 5 horsepower, and the average 
output of the motors from Factory 2 is 132 horsepower with a standard deviation of 4 
horsepower. 

(a) Calculate a 95% confidence interval for the difference in the average horsepower for 
engines coming from the two factories and interpret it in context. 

(b) Based on your confidence interval, is there a significant evidence that there is a dif- 
ference in performance in engines made in the two factories? If so, can you tell which 
factory produces motors with lower performance? Explain. 

(c) Recently upgrades were made in Factory 2. Do these data prove that these upgrades 
enhanced the performance in engines made in this factory? Explain. 
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6.19 Chicken farming is a multi-billion dollar industry, and any methods that increase the 
growth rate of young chicks can reduce consumer costs while increasing company profits, 
possibly by millions of dollars. An experiment was conducted to measure and compare the 
effectiveness of various feed supplements on the growth rate of chickens. Newly hatched 
chicks were randomly allocated into six groups, and each group was given a different feed 
supplement. Their weights in grams after six weeks are given along with feed types in the 
data set called chickwts. Below are some summary statistics from this data set along with 
box plots showing the distribution of weights by feed type. [38] 
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SD 


n 
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(a) Describe the distributions of weights of chickens that were fed linseed and horsebean. 

(b) Do these data provide strong evidence that the average weights of chickens that were 
fed linseed and horsebean are different? Use a 5% significance level. 

(c) What type of error might we have committed? Explain. 

(d) Would your conclusion change if we used a = 0.01? 



6.20 Casein is a common weight gain supplement for humans. Does it have the same effect 
on chickens? Using data provided in Exercise 6.19, test the hypothesis that the average 
weight of chickens that were fed casein is different than the average weight of chickens that 
were fed soybean. Assume that conditions for inference are satisfied. 

6.21 Each year the US Environmental Protection Agency (EPA) releases fuel economy 
data on cars manufactured in that year. Below are summary statistics on fuel efficiency 
(in miles/gallon) from random samples of cars with manual and automatic transmissions 
manufactured in 2010. Do these data provide strong evidence of a difference between the 
average fuel efficiency of cars with manual and automatic transmissions in terms of their 
average city mileage? Assume that conditions for inference are satisfied. [39] 



City MPG Hwy MPG 



Mean SD Mean SD n~ 
Manual 21.08 4.29 29.31 4.63 26 

Automatic 15.62 2.76 21.38 3.73 26 
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6.22 An organization is studying whether women have caught up to men in starting pay 
after attending college. They randomly sampled 28 women, who earned an average of 
$38,293.78 out of college with a standard deviation of $5,170.22. Twenty-four men were 
also randomly sampled, and their earnings averaged $41,981.82 with a standard deviation 
of $3,195.42. Using a significance level of 0.02, do these data provide strong evidence that 
women have not yet caught up to men in terms of pay? If so, can we make a causal con- 
clusion? If so, explain why. If not, provide an example of why the causal interpretation 
would not be valid. 



6.23 Exercise 6.21 provides data on fuel efficiency of cars manufactured in 2010. Use 
these statistics to calculate a 95% confidence interval for the difference between average 
highway mileage of manual and automatic cars and interpret this interval in context. 



6.24 Exercise 6.22 provides summary statistics on the starting salaries of men and women 
who recently graduated from a college. Based on this information, calculate a 98% con- 
fidence interval for the difference between the average starting salaries of such men and 
women and interpret this interval in context. 



6.25 A group of researchers hypothesize that the presence of distracting stimuli during 
eating increases the amount of food consumed by a person and could thereby contribute 
to overeating and obesity. To test their hypothesis, the researchers monitored food intake 
for a group of 44 patients who were randomized into two equal groups. The treatment 
group ate lunch while playing solitaire, and the control group ate lunch without any added 
distractions. Patients in the control group ate 57.1 grams of biscuits, with a standard 
deviation of 45.1 grams, and patients in the treatment group ate 27.1 grams of biscuits, 
with a standard deviation of 26.4 grams. Do these data provide strong evidence that 
the amount of biscuits consumed by the patients in the treatment and control groups are 
different? Assume that assumptions and conditions for inference are satisfied. [40] 



6.26 The researchers from Exercise 6.25 also investigated the effects of being distracted 
by a game on how much people eat. The 22 patients in the treatment group who ate their 
lunch while playing solitaire were asked to do a serial-order recall of the food lunch items 
they ate. The average number of items recalled by the patients in this group was 4.9, with 
a standard deviation of 1.8. The average number of items recalled by the patients in the 
control group (no distraction) was 6.1, with a standard deviation of 1.8. Do these data 
provide strong evidence that the average number of food items recalled by the patients in 
the treatment and control groups are different? 
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6.5.3 Small sample hypothesis testing for a proportion 

6.27 A popular uprising that started on January 25, 2011 in Egypt led to the 2011 
Egyptian Revolution. Polls show that about 69% of American adults followed the news 
about the political crisis and demonstrations in Egypt closely during the first couple weeks 
following the start of the uprising. Among a random sample of 30 high school students, it 
was found that only 17 of them followed the news about Egypt closely during this time. 
[41] 



(a) Write the hypotheses for testing if the proportion of high school students who followed 
the news about Egypt is different than the proportion of American adults who did. 

(b) Calculate the proportion of high schoolers in this sample who followed the news about 
Egypt closely during this time. 

(c) For large sample theory, we modeled p using the normal distribution. Why should we 
be cautious about this approach for these data? 

(d) Since the normal approximation may not be as reliable here as a small sample ap- 
proach, we evaluate the hypotheses using a simulation. Describe how to perform such 
a simulation and, once you had results, how to estimate the p-value. 

(e) Below is a histog ram showing the distribution of psim 

in 10,000 simulations under the 
null hypothesis. Estimate the p-value using the plot and determine the conclusion of 
the hypothesis test. 
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6.28 Assisted Reproductive Technology (ART) is a collection of techniques that help 
facilitate pregnancy (e.g. in vitro fertilization). A 2008 report by the Centers for Disease 
Control and Prevention estimated that ART has been successful in leading to a live birth 
in 31% of cases [42]. A new infertility clinic claims that their success rate is higher than 
average. A random sample of 30 of their patients yielded a success rate of 40%. 



(a) Write the hypotheses to test if the success rate for ART at this clinic is significantly 
higher than the average success rate. 



(b) For large sample theory, we modeled p using the normal distribution. Why is this not 
appropriate here? 



(c) The normal approximation would be less reliable here, so we use a simulation strategy. 
Describe a setup for a simulation that would be appropriate in this situation and how 
the p-value can be calculated using the simulation results. 



(d) Below is a histog rcLm showing the distribution of psim 

in 10,000 simulations under 

the null hypothesis. Estimate the p-value using the plot and use it to evaluate the 
hypotheses. 
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6.5.4 Hypothesis testing for two proportions 

6.29 A "social experiment" conducted by a TV program questioned what people do when 
they see a very obviously bruised woman getting picked on by her boyfriend. On two 
different occasions at the same restaurant the same couple was depicted, however in one 
scenario the woman was dressed "provocatively" and in the other scenario the woman was 
dressed "conservatively" . The table below shows how many restaurant diners were present 
under each scenario, and whether or not they intervened. 



Scenario 

Provocat 

Intervene 



Provocative Conservative Total 
Yes 5 15 20~~ 

No 15 10 25 

Total 20 25 45 



A simulation was conducted to test if people react differently under the two scenarios. In 
order to conduct the simulation, a researcher wrote yes on 20 index cards and no on 25 
index cards to indicate whether or not a diner (represented by each card) intervened. Then 
he shuffled the cards and dealt them into two groups of size 20 and 25, the provocative 
and conservative scenarios, respectively. He counted how many diners in each scenario 
intervened, calculated the difference between the simulated proportions of intervention as 
Ppr,sim~Pcon,sim- This simulation was repeated 10,000 times using software to obtain 10,000 
differences that are due to chance alone. The histogram below shows the distribution of 
the simulated differences. 
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(a) What are the hypotheses? 

(b) Calculate the observed difference between the rates of intervention under the two sce- 
narios. 

(c) Estimate the p-value using the figure above and determine the conclusion of the hy- 
pothesis test. 
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6.30 An experiment conducted by the MythBusters, a science entertainment TV program 
on the Discovery Channel, tested if a person can be subconsciously influenced into yawning 
if another person near them yawns. 50 people were randomly assigned to two groups: 34 
to a group where a person near them yawned (treatment) and 16 to a group where there 
wasn't a yawn seed (control). The following table shows the results of this experiment. [43] 



Group 

Treatment Control Total 



Result 



Yawn 


10 


4 


14 


Not Yawn 


24 


12 


36 


Total 


34 


16 


50 



A simulation was conducted to understand the distribution of the test statistic under the 
assumption of independence: having someone yawn near another person has no influence 
on if the other person will yawn. In order to conduct the simulation, a researcher wrote 
yawn on 14 index cards and not yawn on 36 index cards to indicate whether or not a person 
yawned. Then he shuffled the cards and dealt them into two groups of size 34 and 16 for 
treatment and control, respectively. He counted how many participants in each simulated 
group yawned in an apparent response to a nearby yawning person, calculated the difference 
between the simulated proportions of yawning as ptrtmt,sim ~ Petri, sim- This simulation was 
repeated 10,000 times using software to obtain 10,000 differences that are due to chance 
alone. The histogram below shows the distribution of the simulated differences. 
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(a) What are the hypotheses? 

(b) Calculate the observed difference between the yawning rates under the two scenarios. 

(c) Estimate the p-value using the figure above and determine the conclusion of the hy- 
pothesis test. 



Chapter 7 



Introduction to linear 
regression 



Linear regression is a very powerful statistical technique. Many people have some familiarity 
with regression just from reading the news, where graphs with straight lines are overlaid 
on scatterplots. Linear models can be used for prediction or to evaluate whether there is a 
linear relationship between two numerical variables. 

Figure 7.1 shows two variables whose relationship can be modeled perfectly with a 
straight line. The equation for the line is 

y = 7 + 49.24x 

Imagine what a perfect linear relationship would mean: you would know the exact value 
of y, just by knowing the value of x. This is unrealistic in almost any natural process. 
Consider height and weight of school children for example. Their height, x, gives you some 
information about their weight, y, but there is still a lot of variability, even for children of 
the same height. 

We often write a linear regression line as 

/3 n ,/3i y = Po + Pix 

Linear model 

parameters where /3q and Pi represent two parameters that we wish to identify. Usually x represents 

an explanatory or predictor variable and y represents a response. We use the variable x 
to predict a response y. Usually, we use bo and b\ to denote the point estimates of /3q and 
Pi- 

Examples of several scatterplots are shown in Figure 7.2. While none reflect a perfect 
linear relationship, it will be useful to fit approximate linear relationships to each. The 
lines represent models relating x to y. The first plot shows a relatively strong downward 
linear trend. The second shows an upward trend that, while evident, is not as strong as 
the first. The last plot shows a very weak downward trend in the data, so slight we can 
hardly notice it. 

We will soon find that there are cases where fitting a straight line to the data, even if 
there is a clear relationship between the variables, is not helpful. One such case is shown in 
Figure 7.3 where there is a very strong relationship between the variables even if the trend 
is not linear. Such nonlinear trends are beyond the scope of this textbook. 
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Figure 7.1: Twelve requests were put into a trading company to buy Target 
Corporation stock (ticker TGT, May 24th, 2011), and the total cost of the 
shares were reported. Because the cost is computed using a linear formula, 
the linear fit is perfect. 
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Figure 7.2: Three data sets where a linear model may be useful but is not 
perfect. 
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Figure 7.3: A linear model is not useful in this nonlinear case. (These data 
are from an introductory physics experiment.) 
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Figure 7.4: A scatterplot showing headL against totalL. The first possum 
with a head length of 94.1mm and a length of 89cm is highlighted. 



7.1 Line fitting, residuals, and correlation 

It is helpful to think deeply about the line fitting process. In this section, we examine 
criteria for identifying a linear model and introduce a new statistic, correlation. 



7.1.1 Beginning with straight lines 

Scatterplots were introduced in Chapter 1 as a graphical technique to present two numerical 
variables simultaneously. Such plots permit the relationship between the variables to be 
examined with ease. Figure 7.4 shows a scatterplot for the headL and totalL variables 
from the possum data set introduced in Chapter 1. Each point represents a single possum 
from the data. 

The headL and totalL variables are associated. Possums with an above average total 
length also tend to have above average head lengths. While the relationship is not perfectly 
linear, it could be helpful to partially explain the connection between these variables with 
a straight line. 

Straight lines should only be used when the data appear to have a linear relationship, 
such as the case shown in the left panel of Figure 7.5. The right panel of Figure 7.5 shows 
a case where a curved band would be more useful in capturing a different set of data. 



Caution: Watch out for curved trends 

We only consider models based on straight lines in this chapter. If data show a 
nonlinear trend, like that in the right panel of Figure 7.5, more advanced techniques 
should be used. 
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Figure 7.5: Most observations on the left can be captured in a straight band. 
On the right, we have the weight and mpgCity variables represented from 
the cars data set, and a curved band does a better job of capturing these 
cases than a straight band. 



7.1.2 Fitting a line by eye 

We want to describe the relationship between the headL and totalL variables using a line. 
In this example, we will use the total length as the predictor variable, x, to predict a 
possum's head length, y. We could fit the linear relationship by eye, as in Figure 7.6. The 
equation for this line is 

y = 41 + 0.59* x (7.1) 

We can use this model to discuss properties of possums. For instance, our model predicts 
a possum with a total length of 80 cm will have a head length of 

y = 41 + 0.59 * 80 = 88.2mm 

A "hat" on y is used to signify that this is an estimate. The model predicts that possums 
with a total length of 80 cm will have an average head length of 88.2 mm. Without further 
information about an 80 cm possum, this prediction for head length that uses the average 
is a reasonable estimate. Generally, linear models predict the average value of y for a 
particular value of x based on the model. 

7.1.3 Residuals 

Residuals can be thought of as the leftovers from the model fit: 

Data = Fit + Residual 

Each observation will have a residual. If an observation is above the regression line, then 
its residual, the vertical distance from the observation to the line, is positive. Observations 
below the line have negative residuals. One goal in picking the right linear model is for 
these residuals to be as small as possible. 
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Figure 7.6: A reasonable linear model was fit to represent the relationship 
between headL and totalL. 

Three observations are noted specially in Figure 7.6. The "x" has a small, negative 
residual of about -1; the observation marked by "+" has a large residual of about +7; and 
the observation marked by "A" has a moderate residual of about -4. The size of residuals 
is usually discussed in terms of its absolute value. For example, the residual for "A" is 
larger than that of "x" because | — 4| is larger than | — 1|. 



Residual: difference between observed and expected 

The residual of an observation (xi,yt) is the difference of the observed response 
(yi) and the response we would predict based on the model fit (j/i): 

ei = m - v% 

We typically identify iji by plugging Xi into the model. The residual of the i th 
observation is denoted by ej. 



0 Example 7.2 The linear fit shown in Figure 7.6 is given in Equation (7.1). Based 
on this line, formally compute the residual of the observation (77.0,85.3). This ob- 
servation is denoted by "x" on the plot. Check it against the earlier visual estimate, 
-1. 

We first compute the predicted value of point "x" based on the model: 

Vx = 41 + 0.59 * x x = 41 + 0.59 * 77.0 = 86.4 

Next we compute the difference of the actual head length and the predicted head 
length: 



e x = y x - y x = 85.3 - 86.43 = -0.93 
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Figure 7.7: Residual plot for the model in Figure 7.6. 



This is very close to the visual estimate of -1. 

Q Exercise 7.3 If a model underestimates an observation, will the residual be positive 
or negative? What about if it overestimates the observation? Answer in the footnote 1 . 

Q Exercise 7.4 Compute the residuals for the observations (85.0,98.6) ("+" in the 
figure) and (95.5, 94.0) ("A") using the linear model given in Equation (7.1). Answer 
for "+" is in the footnote 2 . 

Residuals are helpful in evaluating how well a linear model fits a data set. We often 
display them in a residual plot such as the one shown in Figure 7.7 for the regression line 
in Figure 7.6. The residuals are plotted at their original horizontal locations but with the 
vertical coordinate as the residual. For instance, the point (85.0,98.6)+ had a residual of 
7.45, so in the residual plot it is placed at (85.0,7.45). Creating a residual plot is sort of 
like tipping the scatterplot over so the regression line is horizontal. 

0 Example 7.5 One purpose of residual plots is to identify characteristics or patterns 
still apparent in data after fitting a model. Figure 7.8 shows three scatterplots with 
linear models in the first row and residual plots in the second row. Can you identify 
any patterns remaining in the residuals? 

In the first data set (first column), the residuals show no obvious patterns. The 
residuals appear to be scattered randomly about 0, represented by the dashed line. 

1 lf a model underestimates an observation, then the model estimate is below the actual. The residual 
- the actual minus the model estimate — must then be positive. The opposite is true when the model 
overestimates the observation: the residual is negative. 

2 First compute the predicted value based on the model: 

y+ = 41 + 0.59 * x+ = 41 + 0.59 * 85.0 = 91.15 

Then the residual is given by 

e+ =y+ -y+ = 98.6 - 91.15 = 7.45 
This was close to the earlier estimate of 7. 
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Figure 7.8: Sample data with their best fitting lines (top row) and their 
corresponding residual plots (bottom row). 



The second data set shows a pattern in the residuals. There is some curvature in the 
scatterplot, which is more obvious in the residual plot. We should not use a straight 
line to model these data and should use a more advanced technique instead. 

The last plot shows very little upwards trend, and the residuals also show no obvious 
patterns. It is reasonable to try to fit this linear model to the data. However, it is 
unclear whether there is statistically significant evidence that the slope parameter is 
different from zero. The point estimate, b%, is not zero, but we wonder if this could 
just be due to chance. We will address this sort of question in Section 7.4. 



7.1.4 Describing linear relationships with correlation 



R 

correlation 



Correlation: strength of a linear relationship 

The correlation describes the strength of the linear relationship between two 
variables and takes values between -1 and 1. We denote the correlation by R. 



We compute the correlation using a formula, just as we did with the sample mean 
and standard deviation. However, this formula is rather complex 3 , so we generally perform 
the calculations on a computer or calculator. Figure 7.9 shows eight plots and their cor- 
responding correlations. Only when the relationship is perfectly linear is the correlation 
either -1 or 1. If the relationship is strong and positive, the correlation will be near +1. 
If it is strong and negative, it will be near -1. If there is no apparent linear relationship 
between the variables, then the correlation will be near zero. 

The correlation is intended to quantify the strength of a linear trend. Nonlinear trends, 
even when strong, sometimes produce correlations that do not reflect the strength of the 
relationship; see three such examples in Figure 7.10. 



3 Formally, we can compute the correlation for observations (x2,y2), 
formula 



(x n ,Vn) using the 



R - 



— E 

n 1 i=1 s x s y 

where x, y, s x , and s y are the sample means and standard deviations for each variable. 



Xi — x yi 



y 
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Figure 7.9: Sample scatterplots and their correlations. The first row shows 
variables with a positive relationship, represented by the trend up and to 
the right. The second row shows variables with a negative trend, where a 
large value in one variable is associated with a low value in the other. 
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Figure 7.10: Sample scatterplots and their correlations. In each case, there 
is a strong relationship between the variables. However, the correlation is 
not very strong, and the relationship is not linear. 
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Figure 7.11: SAT math (percentile) and GPA scores for students after one 
year in college. Two lines are fit to the data, the solid line being the least 
squares line. 



Q Exercise 7.6 While no straight line will fit any of the data sets in any of the 
scatterplots shown in Figure 7.10, try drawing nonlinear curves to each plot. Once 
you create a curve for each, describe what is important in your fit. Comment in the 
footnote 4 . 



7.2 Fitting a line by least squares regression 

Fitting linear models by eye is (rightfully) open to criticism since it is based on an individual 
preference. In this section, we propose least squares regression as a more rigorous approach. 

This section will use data on SAT math scores and first year GPA from a random 
sample of students at a college 0 . A scatterplot of the data is shown in Figure 7.11 along 
with two linear fits. The lines follow a positive trend in the data; students who scored 
higher math SAT scores also tended to have higher GPAs after their first year in school. 

Q Exercise 7.7 Is the correlation positive or negative? Answer in the footnote 6 . 

7.2.1 An objective measure for finding the best line 

We begin by thinking about what we mean by "best". Mathematically, we want a line 
that has small residuals. Perhaps our criterion could minimize the sum of the residual 
magnitudes: 

|ei| + |e 2 | + --- + |e„| (7.8) 

4 Possible explanation: The line should be close to most points and reflect overall trends in the data. 

5 These data were collected by Educational Testing Service from an unnamed college. More information: 
https:/ /www. dartmouth.edu/ chance/course/Syllabi/Princeton96/Classl2.html 

6 Positive: because larger SAT scores are associated with higher GPAs, the correlation will be positive. 
Using a computer, the correlation can be computed: 0.387. 
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We could use a computer program to find a line that minimizes this criterion (the sum). 
This does result in a pretty good fit, which is shown as the dashed line in Figure 7.11. 
However, a more common practice is to choose the line that minimizes the sum of the 
squared residuals: 



The line that minimizes this least squares criterion is represented as the solid line in 
Figure 7.11. This is commonly called the least squares line. Three possible reasons to 
choose Criterion (7.9) over Criterion (7.8) are the following: 

1. It is the most commonly used method. 

2. Computing the line based on Criterion (7.9) is much easier by hand and in most 
statistical software. 

3. In many applications, a residual twice as large as another is more than twice as bad. 
For example, being off by 4 is usually more than twice as bad as being off by 2. 
Squaring the residuals accounts for this discrepancy. 

The first two reasons are largely for tradition and convenience, and the last reason explains 
why Criterion (7.9) is typically most helpful' . 

7.2.2 Conditions for the least squares line 

When fitting a least squares line, we generally require 

• Linearity. The data should show a linear trend. If there is a nonlinear trend (e.g. 
left panel of Figure 7.12), an advanced regression method from another book or later 
course should be applied. 

• Nearly normal residuals. Generally the residuals must be nearly normal. When 
this condition is found to be unreasonable, it is usually because of outliers or concerns 
about influential points, which we will discuss in greater depth in Section 7.3. An 
example of non-normal residuals is shown in the center panel of Figure 7.12. 

• Constant variability. The variability of points around the least squares line remains 
roughly constant. An example of non-constant variability is shown in the right panel 
of Figure 7.12. 

Q Exercise 7.10 Should you have any concerns about applying least squares to the 
satMath and GPA data in Figure 7.11 on the facing page? Answer in the footnote 8 . 

7.2.3 Finding the least squares line 

For the satMath and GPA data, we could write the equation as 



7 There are applications where Criterion (7.8) may be more useful, and there are plenty of other criteria 
we might consider. However, this book only applies the least squares criterion. 

8 The trend appears to be linear, the data fall around the line with no obvious outliers, and the variance 
is roughly constant. Least squares regression can be applied to these data. 




(7.9) 



GPA = /3 0 + p x * satMath 



(7.11) 
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Figure 7.12: Three examples showing when the methods in this chapter are 
insufficient to apply to the data. In the left panel, a straight line does not 
fit the data. In the second panel, there are outliers; two points on the left 
are relatively distant from the rest of the data and one of these points is 
very far away from the line. In the third panel, the variability of the data 
around the line increases with larger values of x. 



Here the equation is set up to predict GPA based on a student's satMath score, which would 
be useful to a college admissions office. These two values, /3o and Pi, are the parameters of 
the regression line. 

Just like in Chapters 4-6, the parameters are estimated using observed data. In 
practice, this estimation is done using a computer in the same way that other estimates, 
like a sample mean, can be estimated using a computer or calculator. However, we can also 
find the parameter estimates by applying two properties of the least squares line: 

• If a; is the mean of the horizontal variable (from the data) and y is the mean of the 
vertical variable, then the point (x,y) is on the least squares line. 

• The slope of the least squares line is estimated by 

h = ^R (7.12) 

where R is the correlation between the two variables, and s x and s y are the sample 
standard deviations of the explanatory variable (variable on the horizontal axis) and 
response (variable on the vertical axis), respectively. 

bo , bi We use b\ to represent the point estimate of the parameter j3\ , and we will similarly use bo 

Sample to represent the sample point estimate for /3o- 

estimates 

of A), /?i Q Exercise 7.13 Table 7.13 shows the sample means for the satMath and GPA vari- 

ables: 54.395 and 2.468. Plot the point (54.395, 2.468) on Figure 7.11 on page 282 to 
verify it falls on the least squares line (the solid line). 



Q Exercise 7.14 Using the summary statistics in Table 7.13, compute the slope for 
the regression line of GPA against SAT math percentiles. Answer in the footnote 9 . 



9 



Apply Equation (7.12) with the summary statistics from Table 7.13 to compute the slope: 

s v 0.741 

bi = — R = 0.387 = 0.03394 

s x 8.450 
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satMath (V) GPA ("y") 
mean x = 54.395 y = 2.468 

sd s x = 8.450 s v = 0.741 

correlation: = 0.387 



Table 7.13: Summary statistics for the satMath and GPA data. 

You might recall from math class the point-slope form of a line (another common 
form is slope-intercept). Given the slope of a line and a point on the line, (xq^q), the 
equation for the line can be written as 

V ~ 2/o = slope * (x - x 0 ) (7-15) 

A common exercise to become more familiar with foundations of least squares regression 
is to use basic summary statistics and point-slope form to produce the least squares line. 



TIP: Identifying the least squares line from summary statistics 

To identify the least squares line from summary statistics: 

• Estimate the slope parameter, b\, using Equation (7.12) 

• Noting that the point (x, y) is on the least squares line, use xq = x and yo = y 
along with the slope b± in the point-slope equation: 

y-y = h(x-x) 

• Simplify the equation. 



0 Example 7.16 Using the point (54.395, 2.468) from the sample means and the slope 
estimate b± = 0.034 from Exercise 7.14, find the least-squares line for predicting GPA 
based on satMath. 



Apply the point-slope equation using (54.395, 2.468) and the slope, b± = 0.03394: 

y-yo = h(x - xq) 
y - 2.468 = 0.03394(a; - 54.395) 

Expanding the right side and then adding 2.468 to each side, the equation simplifies: 

GPA = 0.622 + 0.03394 * satMath 

Here we have replaced y with GPA and x with satMath to put the equation in 
context. This form matches the form of Equation (7.11). 

We mentioned earlier that a computer is usually used to compute the least squares line. 
A summary table based on some computer output is shown in Table 7.14 for the satMath 
and GPA data. The first column of numbers provide estimates for bo and b%, respectively. 
Compare these to the result from Example 7.16. 
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Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


0.6219 


0.1408 


4.42 


0.0000 


satMath 


0.0339 


0.0026 


13.26 


0.0000 



Table 7.14: Summary of least squares fit for the SAT/GPA data. Compare 
the parameter estimates in the first column to the results of Example 7.16. 

0 Example 7.17 Examine the second, third, and fourth columns in Table 7.14. Can 
you guess what they represent? 

These columns help determine whether the estimates are significantly different from 
zero and to create a confidence interval for each parameter. The second column lists 
the standard errors of the estimates, the third column contains t test statistics, and 
the fourth column lists p- values (2-sided test). We will describe the interpretation of 
these columns in greater detail in Section 7.4. 

7.2.4 Interpreting regression line parameter estimates 

Interpreting parameters in a regression model is often one of the most important steps in 
the analysis. 

0 Example 7.18 The slope and intercept estimates for the SAT-GPA data are 0.6219 
and 0.0339. What do these numbers really mean? 

The intercept b Q = 0.6219 describes the average GPA if a student was at the zeroth 
percentile score, if the linear relationship held all the way to satMath = 0. This 
interpretation - while perhaps interesting - may not be very meaningful since there 
are no students enrolled at this college who are very close to the zeroth percentile. 

Interpreting the slope parameter is often more realistic and helpful. For each addi- 
tional SAT percentile score, we would expect a student to have an additional 0.0339 
points in their first-year GPA on average. We must be cautious in this interpretation: 
while there is a real association, we cannot interpret a causal connection between the 
variables. That is, increasing a student's SAT score may not cause the student's GPA 
to increase. 



Interpreting least squares estimate parameters 

The intercept describes the average outcome of y if x = 0 and the linear model 
holds. The slope describes the estimated difference in the y variable if the explana- 
tory variable (x) for a case happened to be one unit larger. 



7.2.5 Extrapolation is treacherous 

When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming 
was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have 
risen. Consider this: On February 6 th it was 10 degrees. Today it hit almost 80. At this rate, by 
August it will be 220 degrees. So clearly folks the climate debate rages on. 

Stephen Colbert 
April 6th, 2010 10 



http: / / www.colbertnation.com / the-colbert-report- videos / 269929 / 
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Linear models are used to approximate the relationship between two variables. How- 
ever, these models have real limitations. Linear regression is simply a modeling framework. 
The truth is almost always much more complex than our simple line. For example, we do 
not know how the data outside of our limited window will behave. 

SAT math scores were used to predict freshman GPA in Section 7.2.3. We could also 
use the overall percentile as a predictor in place of just the math score: 

GPA = 0.0019 + 0.0477 * satTotal 

These data are shown in Figure 7.15. The linear model in this case was built for observations 
between the 2Q th and the 72 nd percentiles. The data meet all conditions necessary for the 
least squares regression line, so could the model safely be applied to students at the 90 th 
percentile? 

# Example 7.19 Use the model GPA = 0.0019 + 0.0477 * satTotal to estimate the 
GPA of a student who is at the 90 th percentile. 



To predict the GPA for a person with an SAT percentile of 90, we plug 90 into the 
regression equation: 

0.0019 + 0.0477 * satTotal = 0.0019 + 0.0477 * 90 = 4.29 

The model predicts a GPA score of 4.29. GPAs only go up to 4.0 (!). 

Applying a model estimate to values outside of the realm of the original data is called 
extrapolation. Generally, a linear model is only an approximation of the real relation- 
ship between two variables. If we extrapolate, we are making an unreliable bet that the 
approximate linear relationship will be valid in places where it has not been analyzed. 
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7.2.6 Using R 2 to describe the strength of a fit 

We evaluated the strength of the linear relationship between two variables earlier using 
correlation, R. However, it is more common to explain the strength of a linear fit using R 2 , 
called R-squared. If we are given a linear model, we would like to describe how closely 
the data cluster around the linear fit. 

The R 2 of a linear model describes the amount of variation in the response that is 
explained by the least squares line. For example, consider the SAT-GPA data, shown in 
Figure 7.15. The variance of the response variable, GPA, is s 2 7PA = 0.549. However, if we 
apply our least squares line, then this model reduces our uncertainty in predicting GPA 
using a student's SAT score. The variability in the residuals describes how much variation 
remains after using the model: s 2 = 0.433. In short, there was a reduction of 



s l, A ~ s l ES = 0-549 - 0-433 = 0T16 
s 2 0.549 0.549 

GPA 

or about 21% in the data's variation by using information about the SAT scores via a linear 
model. This corresponds exactly to the R-squared value: 

R = 0.46 R 2 = 0.21 

Q Exercise 7.20 If a linear model has a very strong negative relationship with a 
correlation of -0.97, how much of the variation in the response is explained by the 
explanatory variable? Answer in the footnote 11 . 



7.3 Types of outliers in linear regression 

In this section, we identify (loose) criteria for which outliers are important and influential. 

Outliers in regression are observations that fall far from the "cloud" of points. These 
points are especially important because they can have a strong influence on the least squares 
line. 

Q Exercise 7.21 There are six plots shown in Figure 7.16 along with the least squares 
line and residual plots. For each scatterplot and residual plot pair, identify any 
obvious outliers and note how you think they influence the least squares line. Recall 
that an outlier is any point that doesn't appear to belong with the vast majority of 
the other points. Answer in the footnote 12 . 

Examine the residual plots in Figure 7.16. You will probably find that there is some 
trend in the main clouds of (3) and (4). In these cases, the outlier influenced the slope of 

11 About R? = (— 0.97) 2 = 0.94 or 94% of the variation is explained by the linear model. 

12 Across the top, then across the bottom: (1) There is one outlier far from the other points, though it 
only appears to slightly influence the line. (2) One outlier on the right, though it is quite close to the least 
squares line, which suggests it wasn't very influential. (3) One point is far away from the cloud, and this 
outlier appears to pull the least squares line up on the right; examine how the line around the primary 
cloud doesn't appear to fit very well. (4) There is a primary cloud and then a small secondary cloud of 
four outliers. The secondary cloud appears to be influencing the line somewhat strongly, making the least 
square line fit poorly almost everywhere. There might be an interesting explanation for the dual clouds, 
which is something that could be investigated. (5) there is no obvious trend in the main cloud of points 
and the outlier on the right appears to largely control the slope of the least squares line. (6) There is one 
outlier far from the cloud, however, it falls quite close to the least squares line and does not appear to be 
very influential. 
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Figure 7.16: Six plots, each with a least squares line and residual plot. All 
data sets have at least one outlier. 
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the least squares line. In (5), data with no clear trend were assigned a line with a large 
trend simply due to one outlier (!). 



Leverage 

Points that fall, horizontally, away from the center of the cloud tend to pull harder 
on the line, so we call them points with high leverage. 



Points that fall horizontally far from the line are points of high leverage; these points 
can strongly influence the slope of the least squares line. If one of these high leverage 
points does appear to actually invoke its influence on the slope of the line - as in cases (3), 
(4), and (5) of Exercise 7.21 - then we call it an influential point. Usually we can say a 
point is influential if, had we fit the line without it, the influential point would have been 
unusually far from the least squares line. 

It is tempting to remove outliers. Don't do this without very good reason. Models that 
ignore exceptional (and interesting) cases often perform poorly. For instance, if a financial 
firm ignored the largest market swings - the "outliers" - they would soon go bankrupt by 
making poorly thought-out investments. 



Caution: Don't ignore outliers when fitting a final model 

If there are outliers in the data, they should not be removed or ignored without 
good reason. Whatever final model is fit to the data would not be very helpful if it 
ignores the most exceptional cases. 



7.4 Inference for linear regression 

In this section we discuss uncertainty in the estimates of the slope and y-intercept for 
a regression line. Just as we identified standard errors for point estimates in previous 
chapters, we first discuss standard errors for these new estimates. However, in the case of 
regression, we will identify standard errors using statistical software. 

7.4.1 Midterm elections and unemployment 

Elections for members of the United States House of Representatives occur every two 
years, coinciding every four years with the U.S. Presidential election. The set of House 
elections occurring during the middle of a Presidential term are called midterm elections, 
and are thought to be closely linked with unemployment. In America's two-party system, 
one political theory suggests the higher the unemployment rate, the worse the President's 
party will do in the midterm elections. 

To assess the validity of this claim, we can compile historical data and look for a 
connection. We consider every midterm election from 1898 to 2010, with the exception 
of those elections during the Great Depression. Figure 7.17 shows these data and the 
least-squares regression line: 

% change in House seats for President's party 

= —2.94 — 1.08 * (unemployment rate) 

We consider the percent change in the number of seats of the President's party (e.g. percent 
change in the number of seats for Democrats in 2010) against the unemployment rate. 
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Figure 7.17: The percent change in House seats for the President's party 
in each election from 1898 to 2010 plotted against the unemployment rate. 
The two points for the Great Depression have been removed, and a least 
squares regression line has been fit to the data. 



© Exercise 7.22 The data for the Great Depression (1934 and 1938) were removed 
because the unemployment rate was 21% and 18%, respectively. Do you agree that 
they should be removed for this investigation? Why or why not? Two considerations 
in the footnote 13 . 

There is a negative slope in the line shown in Figure 7.17. However, this slope (and 
the y-intercept) are only estimates of the parameter values. We might wonder, is this 
convincing evidence that the "true" linear model has a negative slope? That is, do the 
data provide strong evidence that the political theory is accurate? We can frame this 
investigation into a one-sided statistical hypothesis test: 

Hq- Pi = 0. The true linear model has slope zero. 

Ha'- Pi < 0. The true linear model has a negative slope. That is, the higher the unemploy- 
ment, the greater the losses for the President's party in the House of Representatives. 

We would reject H$ in favor of Ha if the data provide strong evidence that the true slope 
parameter is less than zero. To assess the hypothesis test, we identify a standard error for 
the estimate, compute an appropriate test statistic, and identify the p-value. 

7.4.2 Understanding regression output from software 

Just like other point estimates we have seen before, we can compute a standard error for 
b\ and a test statistic. We will generally label the test statistic using a T, since it follows 
the t distribution. 

13 Each of these points would have very high leverage on any least-squares regression line, and years 
with such high unemployment may not help us understand what would happen in other years where the 
unemployment is only modestly high. On the other hand, these are exceptional cases, and we would be 
discarding important information if we exclude them from a final analysis. 
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-4.35 -2.9 -1.45 0 1.45 2.9 4.35 



Figure 7.19: The distribution shown here is the sampling distribution for 
bi, if the null hypothesis was true. The shaded tail represents the p- value 
for the hypothesis test evaluating whether there is convincing evidence that 
higher unemployment corresponds to a greater loss of House seats for the 
President's party during a midterm election. 



We will rely on statistical software to compute the standard error and leave the deriva- 
tion to a second or third statistics course. Table 7.18 shows software output for the least 
squares regression line in Figure 7.17. The row labeled unemp represents the information 
for the slope, which is the coefficient of the unemployment variable. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


-2.9417 


9.0851 


-0.32 


0.7488 


unemp 


-1.0805 


1.4513 


-0.74 


0.4635 



df = 25 



Table 7.18: Output from statistical software for the regression line model- 
ing the midterm election losses for the President's party as a response to 
unemployment . 



0 Example 7.23 There are two rows in Table 7.18: one for the y-intercept estimate 
and one for the slope estimate. What do the first and second columns represent? 

The entries in the first column represent the least squares estimate, and the values in 
the second column correspond to the standard errors of each estimate. 

We previously used a t test statistic for hypothesis testing on small (or large!) samples. 
Regression is very similar. In the hypotheses we consider, the null value for the slope is 0, 
so we can compute the test statistic using the T (or Z) score formula: 

„ estimate — null value —1.0808 — 0 

T = = = -0.74 

SE 1.4513 



We can look for the one-sided p- value - shown in Figure 7.19 - using the probability table 
for the t distribution in Appendix C.2 on page 364. 
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0 Example 7.24 Table 7.18 offers the degrees of freedom for the test statistic T: 
df = 25. Identify the p-value for the hypothesis test. 

Looking in the 25 degrees of freedom row in Appendix C.2, we see that the absolute 
value of the test statistic is smaller than any value listed, which means the tail area 
and therefore also the p-value is larger than 0.100 (one tail!). Because the p-value 
is so large, we fail to reject the null hypothesis. That is, the data do not provide 
convincing evidence that a higher unemployment rate tends to correspond to larger 
losses for the President's party in the House of Representatives in midterm elections. 

We could have identified the t test statistic from the software output in Table 7.18, 
shown in the second row (unemp) and third column under (t value) . The entry in the second 
row and last column in Table 7.18 represents the p-value for the two-sided hypothesis test 
where the null value is zero. Under close examination, we can see a null hypothesis of 0 
was also used to compute the p-value in the (Intercept) row. However, we are more often 
interested in evaluating the significance of the slope since the slope represents a connection 
between the two variables while the intercept represents an estimate of the average outcome 
if the x-value was zero. 



Inference for regression 

We usually rely on statistical software to identify point estimates and standard 
errors for parameters of a regression line. After verifying conditions hold for fitting 
a line, we can use the methods learned in Section 6.1 for the t distribution to create 
confidence intervals for regression parameters or to evaluate hypothesis tests. 



Caution: Don't carelessly use the p-value from regression output 

The last column in regression output is often used to list p- values for one particular 
hypothesis: a two-sided test where the null value is zero. If your test is one-sided 
and the point estimate is in the direction of Ha, then you can halve the software's 
p-value to get the one-tail area. If neither of these scenarios match your hypothesis 
test, be cautious about using the software output to obtain the p-value. 



0 Example 7.25 Examine Figure 7.15 on page 287, which relates freshman GPA and 
SAT scores. How sure are you that the slope is statistically significantly different 
from zero? That is, do you think a formal hypothesis test would reject the claim that 
the true slope of the line should be zero? Why or why not? 

While the relationship between the variables is not perfect, it is difficult to deny that 
there is some increasing trend in the data. This suggests the hypothesis test will 
reject the null claim that the slope is zero. 

Q Exercise 7.26 Table 7.20 shows statistical software output from fitting the least 
squares regression line shown in Figure 7.15. Use this output to formally evaluate the 
following hypotheses. H 0 : The true coefficient for satTotal is zero. Ha' The true 
coefficient for satTotal is not zero. Answer in the footnote . 

14 We look in the second row corresponding to the satTotal variable. We see the point estimate of the 
slope of the line is 0.0477, the standard error of this estimate is 0.0029, and the t test statistic is 16.38. The 
p-value corresponds exactly to the two-sided test we are interested in: 0.0000. This output doesn't mean 
the p-value is exactly zero, only that when rounded to four decimal places it is zero. That is, the p-value 
is so small that we can reject the null hypothesis and conclude that GPA and SAT scores are positively 
correlated and the true slope parameter is indeed greater than 0, just as we believed in Example 7.25. 
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Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


0.0019 


0.1520 


0.01 


0.9899 


satTotal 


0.0477 


0.0029 


16.38 


0.0000 



Table 7.20: Summary of least squares fit for the SAT/GPA data. 



TIP: Always check assumptions 

If conditions for fitting the regression line do not hold, then the methods presented 
here should not be applied. The standard error or distribution assumption of the 
point estimate - assumed to be normal when applying the t test statistic - may 
not be valid. 



7.4.3 An alternative test statistic 

We considered the t test statistic as a way to evaluate the strength of evidence for a 
hypothesis test in Section 7.4.2. However, we could focus on R 2 . Recall that R 2 described 
the proportion of variability in the response variable (y) explained by the explanatory 
variable (x). If this proportion is large, then this suggests a linear relationship exists 
between the variables. If this proportion is small, then the evidence provided by the data 
may not be convincing. 

This concept - considering the amount of variability in the response variable explained 
by the explanatory variable - is a key component in some statistical techniques. A method 
called analysis of variance (ANOVA) relies on this general principle and is a common 
topic in statistics. The method states that if enough variability is explained away by the 
explanatory variable, then we would conclude the variables are connected. On the other 
hand, we might not be convinced if only a little variability is explained. We will discuss 
this method further in Section 8.4. 
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7.5 Exercises 



7.5.1 Line fitting, residuals, and correlation 

7.1 The scatterplots shown below each have a superimposed regression line. If we were 
to construct a residual plot (residuals versus x) for each, what would those plots look like? 




7.2 Shown below are two plots of residuals remaining after fitting a linear model to two 
different sets of data. Describe any apparent trends in these plots and determine if a linear 
model would be appropriate for these data. Explain your reasoning. 



V. 



(a) 




7.3 For each of the six plots, identify the strength of the relationship (e.g. weak, moderate, 
or strong) in the data and whether fitting a linear model would be reasonable. 




(4) 



(5) 



(6) 
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7.4 For each of the six plots, identify the strength of the relationship (e.g. weak, moderate, 
or strong) in the data and whether fitting a linear model would be reasonable. 




7.5 Below are two scatterplots based on grades recorded during several years for a Statis- 
tics course at a university. The first plot shows a scatterplot for final exam grade versus 
first exam grade for 233 students, and the second scatterplot shows the final exam grade 
versus second exam grade. 



(a) Based on these graphs, which of the two exams has the strongest correlation with the 
final exam grade? Explain. 

(b) Can you think of a reason why the correlation between the exam you chose in part (a) 
and the final exam is higher? 
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7.6 The Great Britain Office of Population Census and Surveys once collected data on a 
random sample of 170 married couples in Britain, recording the age (in years) and heights 
(converted here to inches) of the husbands and wives [44] . The scatterplot on the left shows 
the wife's age plotted against her husband's age, and the plot on the right shows wife's 
height plotted against husband's height. 
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Husband's height (in inches) 



(a) Describe the relationship between husbands' and wives' ages. 

(b) Describe the relationship between husbands' and wives' heights. 

(c) Which plot shows a stronger correlation? Explain your reasoning. 

(d) Data on heights were originally collected in centimeters, and then converted to inches. 
Does this conversion affect the correlation between husbands' and wives' heights? 



7.7 Match the calculated 
correlations to the corre- 
sponding scatterplot. 

(a) R = -0.73 

(b) R = 0.35 

(1) (2) 



(c) R = -0.02 

(d) R = 0.92 




(3) (4) 
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7.8 Match the calculated 
correlations to the corre- 
sponding scatterplot. 

(a) R = 0.35 

(b) R = -0.52 

(c) R = -0.06 

(d) R = -0.80 




(3) (4) 

7.9 63 college students were asked to fill out a survey where they were asked about their 
height, fastest speed they ever drove at, and gender. Below is a scatterplot displaying the 
relationship between height and fastest speed. 
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(a) Describe the relationship between height and fastest speed. 

(b) Why do you think these variables are positively associated? 

(c) Below is another scatterplot displaying the relationship between height and fastest 
speed. In this plot, female students are represented with triangles and male students 
are represented with circles. How does gender play a role in the relationship between 
height and fastest speed. 



7.5. EXERCISES 



299 



150- 



CL 

E 

~ 100 " 

T3 
CD 
CD 
CL 
CO 

B 

00 
CO 





A 






0 




A I 






A 






A 


o 


A 


A 










A 


A 


A o 




Q 


A 


A 


A 


A 


V 


A 


A 


A 


& 




A 

A 




A 


A 





a Female 
Male 



60 



65 



70 



75 



Height (in inches) 



7.10 The scatterplots below show the relationship between height, diameter, and volume 
of timber in 31 felled black cherry trees. Note that the diameter of the tree is measured at 
4 feet 6 inches above the ground. [45] 
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(a) Describe the relationship between volume and height of these trees. 

(b) Describe the relationship between volume and diameter of these trees. 

(c) Suppose you have height and diameter measurements for another black cherry tree. 
Which of these variables would you prefer to use to predict the volume of timber in 
this tree using a simple linear regression model? Explain your reasoning. 
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7.11 The Coast Starlight Amtrak train runs from Seattle to Los Angeles. The scatterplot 
below displays the distance between each stop (in miles) and the amount of time it takes 
to travel from one stop to another (in minutes). 



(a) Describe the relationship between dis- 
tance and travel time. 

(b) How would the relationship change if 
travel time was instead measured in 
hours, and distance was instead mea- 
sured in kilometers? 

(c) Correlation between travel time (in 
miles) and distance (in minutes) is R = 
0.636. What is the correlation between 
travel time (in kilometers) and distance 
(in hours). 



O 300- 
c 

£ 240 ■ 
O 

E 180- 



g 120- 




Distance (miles) 



7.12 A study conducted at University of Denver investigated whether babies take longer 
to learn to crawl in cold months, when they are often bundled in clothes that restrict their 
movement, than in warmer months [46]. Infants born during the study year were split into 
twelve groups, one for each birth month. We consider the average crawling age of babies 
in each group against the average temperature when the babies are six months old (that's 
when babies often begin trying to crawl). Temperature is measured in degrees Fahrenheit 
(°F) and age is measured in weeks. 




Temperature (in F) 



(a) Describe the relationship between temperature and crawling age. 

(b) How would the relationship change if temperature was measured in degrees Celsius 
(°C) and age was measured in months instead of in weeks? 

(c) The correlation between temperature in Fahrenheit and age in weeks was R = —0.70. If 
we converted the temperature to Celsius and age to months, what would the correlation 
be? 
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7.13 Researchers studying anthropometry collected body girth measurements and skeletal 
diameter measurements, as well as age, weight, height and gender for 507 physically active 
individuals [21]. The scatterplot below shows the relationship between height and shoulder 
girth (over deltoid muscles), both measured in centimeters. 

(a) Describe the relationship between shoulder girth and height. 

(b) How would the relationship change if shoulder girth was measured in inches while the 
units of height remained in centimeters? 

200 ~i ~ 1 




90 100 110 120 130 
Shoulder girth (in cm) 



7.14 The scatterplot below shows the relationship between weight measured in kilograms 
and hip girth measured in centimeters from the data described in Exercise 7.13. 



(a) Describe the relationship between hip 
girth and weight. 

(b) How would the relationship change if 
weight was measured in pounds while 
the units for hip girth remained in cen- 
timeters? 
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7.15 What would be the correlation between the ages of husbands and wives if men always 
married woman who were 

(a) 3 years younger than themselves? 

(b) 2 years older than themselves? 

(c) half as old as themselves? 
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7.16 What would be the correlation between the annual salaries of males and females at 
a company if for a certain type of position men always made 

(a) $5,000 more than women? 

(b) 25% more than women? 

(c) 15% less than women? 



7.5.2 Fitting a line by least squares regression 

7.17 The Association of Turkish Travel Agencies reports the number of foreign tourists 
visiting Turkey and tourist spending by year [47]. The scatterplot below shows the rela- 
tionship between these two variables along with the least squares fit. 



(a) Describe the relationship between num- 
ber of tourists and spending. 

(b) What are the explanatory and response 
variables? 

(c) Why might we want to fit a regression 
line to these data? 
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7.18 The scatterplot below shows the relationship between number of calories and amount 
of carbohydrates (in grams) Starbucks food menu items contain [48] . Since Starbucks only 
lists the number of calories on the display items, we are interested in predicting the amount 
of carbs a menu item has based on its calorie content. 



(a) Describe the relationship between num- 
ber of calories and amount of carbohy- 
drates (in grams) Starbucks food menu 
items contain. 

(b) In this scenario, what are the explana- 
tory and response variables? 

(c) Why might we want to fit a regression 
line to these data? 
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7.19 Does the tourism data plotted in Exercise 7.17 meet the conditions required for 
fitting a least squares line? In addition to the scatterplot provided in Exercise 7.17, use 
the residuals plot and the histogram below to answer this question. 
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7.20 Does the Starbucks nutrition data plotted in Exercise 7.18 meet the conditions 
required for fitting a least squares line? In addition to the scatterplot provided in Exer- 
cise 7.18, use the residuals plot and the histogram below to answer this question. 
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7.21 Exercise 7.11 introduces data on the Coast Starlight Amtrak train that runs from 
Seattle to Los Angeles. The mean travel time from one stop to the next on the Coast 
Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled 
from one stop to the next is 107 miles with a standard deviation of 99 miles. The correlation 
between travel time and distance is 0.636. 

(a) Write the equation of the regression line for predicting travel time. 

(b) Interpret the slope and the intercept in context. 

(c) The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to 
estimate the time it takes for the Starlight to travel between these two cities. 

(d) It actually takes the the Coast Starlight 168 mins to travel from Santa Barbara to Los 
Angeles. Calculate the residual and explain the meaning of this residual value. 

(e) Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away 
from Los Angeles. Would it be appropriate to use this linear model to predict the 
travel time from Los Angeles to this point? 
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7.22 Exercise 7.13 introduces data on shoulder girth and height of a group of individuals. 
The mean shoulder girth is 108.20 cm with a standard deviation of 10.37 cm. The mean 
height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height 
and shoulder girth is 0.666. 

(a) Write the equation of the regression line for predicting height. 

(b) Interpret the slope and the intercept in context. 

(c) A randomly selected student from your class has a shoulder girth of 100 cm. Predict 
the height of this student. 

(d) This student is actually 160 cm tall. Calculate the residual and explain what this 
residual means. 

(e) A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear 
model to predict the height of this child? 

7.23 Based on the information given in Exercise 7.21, calculate R 2 of the regression line 
for predicting travel time from distance traveled for the Coast Starlight and interpret it in 
context. 

7.24 Based on the information given in Exercise 7.22, calculate R 2 of the regression line 
for predicting height from shoulder girth and interpret it in context. 

7.25 Data were collected on the number of hours 
per week students watch TV and the grade they 
earned in a statistics class (out of 100). Based on 
the scatterplot and the residuals plot provided, de- 
scribe the relationship between the two variables 
and determine if a simple linear model is appro- 
priate to predict grade from the number of hours 
per week the student watches TV. 



7.26 Exercise 7.18 introduces a data set on nu- 
trition information on Starbucks food menu items. 
Based on the scatterplot and the residuals plot 
provided, describe the relationship between the 
protein content and calories of these menu items 
and determine if a simple linear model is appro- 
priate to predict protein amount from the number 
of calories. 
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7.5.3 Types of outliers in linear regression 

7.27 Identify the outliers in the scatterplots shown below and determine what type of 
outliers they are. Explain your reasoning. 




(a) (b) (c) 



7.28 Identify the outliers in the scatterplots shown below and determine what type of 
outliers they are. Explain your reasoning. 




(a) (b) (c) 



7.29 Exercise 7.12 introduces data on the average monthly temperature during the month 
babies first try to crawl (about 6 months after birth) and the average first crawling age 
for babies born in a given month. A scatterplot of these two variables reveals an outlying 
month when the average temperature is about 53 degrees Fahrenheit and average crawling 
age is about 28.5 weeks. What type of an outlier is this month? Explain your reasoning. 

7.30 The scatterplot below shows the percent of families who own their home vs. the 
percent of the population living in urban areas in 2000 [49]. There are 52 observations, 
each corresponding to a state in the US (including Puerto Rico and District of Columbia). 



(a) Describe the relationship between the 
percent of families who own their home 
and the percent of the population living 
in urban areas in 2000. 

(b) The outlier at the bottom right corner is 
District of Columbia, where 100% of the 
population is considered urban. What 
type of an outlier is this observation? 
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7.5.4 Inference for linear regression 



7.31 How well does the number of beers a student drinks predict his or her blood alcohol 
content? Sixteen student volunteers at Ohio State University drank a randomly assigned 
number of cans of beer. Thirty minutes later, a police officer measured their blood alcohol 
content (BAC) in grams of alcohol per deciliter of blood [50]. Note that in all states of the 
U.S., the legal BAC limit is 0.08. In this experiment, the students were equally divided 
between men and women and differed in weight and usual drinking habits. Because of this 
variation, many students don't believe that the number of drinks predicts BAC well. In 
this problem, we examine how well the number of drinks predicts BAC when there are no 
other predictors available (Note: each persons tolerance is different, and you should not 
rely on these data to predict your BAC with high accuracy!) Given below is a scatterplot 
displaying the relationship between BAC and number of cans of beer as well as a summary 
output of the least squares fit for these data. 
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(a) Describe the relationship between number of cans of beer and BAC. 

(b) Write the equation of the regression line. Interpret the slope and intercept in context. 

(c) Do the data provide strong evidence that drinking more beers is associated with an 
increase in blood alcohol? State the null and alternative hypotheses, report the p- value, 
and state your conclusion. 

(d) The correlation coefficient for number of cans of beer and BAC is 0.89. Calculate R 2 
and interpret it in context. 



7.32 The scatterplot and least squares summary below show the relationship between 
weight measured in kilograms and height measured in centimeters based on data discussed 
in Exercise 7.13. 
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(a) Describe the relationship between height and weight. 

(b) Write the equation of the regression line. Interpret the slope and intercept in context. 

(c) Do the data provide strong evidence that an increase in height is associated with an 
increase in weight? State the null and alternative hypotheses, report the p-value, and 
state your conclusion. 

(d) The correlation coefficient for height and weight is 0.72. Calculate R 2 and interpret it 
in context. 

7.33 Exercise 7.6 presents a scatterplot displaying the relationship between husbands' 
and wives' ages in a random sample of 170 married couples in Britain. Given below is a 
summary output of the least squares fit for predicting wife's age from husband's age. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


1.5740 
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(a) Is there a statistically significant linear relationship between husbands' and wives' 
heights? State the hypotheses and include any information used to conduct the test. 

(b) Write the equation of the regression line for predicting wife's age from husband's age. 

(c) Interpret the slope and intercept in context. 

7.34 Exercise 7.6 presents a scatterplot displaying the relationship between husbands' 
and wives' heights in a random sample of 170 married couples in Britain. Given below is a 
summary output of the least squares fit for predicting wife's height from husband's height. 





Estimate 


Std. Error 


t value 
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(a) Is there strong evidence that taller men marry taller women? State the hypotheses and 
include any information used to conduct the test. 

(b) Write the equation of the regression line for predicting wife's age from husband's age. 

(c) Interpret the slope and intercept in context. 

7.35 Exercise 7.33 provides a summary output of the least squares fit for predicting wife's 
age from husband's age. 

(a) Given that R 2 = 0.88 , what is the correlation of husband and wife height in this data 
set? 

(b) You meet a married man from Britain not included in the sample of 170 couples but 
who comes from the population that was sampled for this study and who is 55 years 
old. What would you predict his wife's height to be? How reliable is this prediction? 

(c) You meet another married man from Britain not included in the sample of 170 couples 
but who comes from the population that was sampled for this study and who is 85 
years old. Would it be wise to use the same linear model to predict his wife's age? 
Why or why not. 
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7.36 Exercises 7.6 and 7.34 provide a summary plot and a regression summary table for 
predicting wife's height from husband's height. 

(a) Given that R 2 = 0.09 , what is the correlation of husband and wife height in this data 
set? 

(b) You meet a married man from Britain not included in the sample of 170 couples but who 
comes from the population that was sampled for this study and who is 5'9" (69 inches). 
What would you predict his wife's height to be? How reliable is this prediction? 

(c) You meet another married man from Britain not included in the sample of 170 couples 
but who comes from the population that was sampled for this study and who is 6'7" 
(79 inches). Would it be wise to use the same linear model to predict his wife's height? 
Why or why not? 

7.37 Exercise 7.30 gives a scatterplot displaying the relationship between the percent of 
families that own their home. Below is a summary of a least squares line for these data, 
excluding District of Columbia. There were 51 cases. 



(a) For these data, R 2 = 0.28. What is the correlation? How can you tell if it is positive 
or negative? 

(b) Test the hypothesis that there is no linear association between percent home ownership 
and percent of the population living in an urban setting. State the hypotheses and 
include any information used to conduct the test. 

(c) Calculate the predicted values percent home ownership for populations where 40% and 
80% live in an urban setting. Use these two values to sketch the regression line on the 
scatterplot. State whether you believe a simple least squares line adequately fits these 
data. 

(d) The residual plot for this regression is given below. How would you describe the trend 
visible in this plot? Based on this plot, should a simple least squares line be fit to these 
data? 
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The principles of simple linear regression with one numerical predictor and one numerical 
response lay the foundation for more sophisticated regression methods used in a wide range 
of challenging settings. In Chapter 8, we explore multiple regression, which introduces 
the possibility of more than one predictor. We will also consider methods for analysis of 
variance (ANOVA), a tool useful both in practice and when learning about the mechanics 
of regression. 

8.1 Introduction to multiple regression 

Multiple regression extends the simple bivariate regression (two variables: x and y) to the 
case that still has one response but may have many predictors (denoted x\, X2, £3, •■■)■ The 
method is motivated by scenarios where many variables may be simultaneously connected 
to an output. 

We will consider Ebay auctions of a video game called Mario Kart for the Nintendo 
Wii. The outcome variable of interest is the total price of an auction - the highest bid plus 
the shipping cost. But how is the total price related to characteristics of an auction? For 
instance, are longer auctions associated with a higher or lower prices? And, on average, 
how much more do buyers tend to pay for additional Wii wheels (plastic steering wheels 
that attach to the Wii controller) in auctions? Multiple regression will help us answer these 
and other questions. 

The data set marioKart includes results from 143 auctions 1 . Four observations from 
this data set are shown in Table 8.1, and descriptions for each variable are shown in Ta- 
ble 8.2. 

8.1.1 Using categorical variables with two levels as predictors 

There are two predictor variables in the marioKart data set that are inherently categorical: 
a variable describing the condition of the game and the variable describing whether a stock 
photo was used for the auction. Two- level categorical variables are often coded using O's 

1 Diez DM, Barr CD, and Cetinkaya M. 2011. openintro: Openlntro data sets and supplemental func- 
tions. R package Version 1.2. 
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totalPr condNew stockPhoto duration wheels 



1 


51.55 


1 


1 


3 


1 


2 


37.04 


0 


1 


7 


1 


142 


38.76 


0 


0 


7 


0 


143 


54.51 


1 


1 


1 


2 



Table 8.1: Four observations from the marioKart data set. 



variable description 

totalPr the total of the final auction price and the shipping cost, in US 

dollars 

condNew a coded two-level categorical variable, which takes value 1 when the 

game is new and 0 if the game is used 
stockPhoto a coded two-level categorical variable, which takes value 1 if the 

primary photo used in the auction was a stock photo and 0 if the 

photo was unique to that auction 
duration the length of the auction, in days 

wheels the number of Wii wheels included with the auction (a Wii wheel 

is a plastic racing wheel that holds the Wii controller and is an 
optional but helpful accessory for playing Mario Kart Wii) 

Table 8.2: Variables and their descriptions for the marioKart data set. 

and l's, which allows them to be incorporated into a regression model in the same way as 
a numerical predictor: 

totalPr = + Pi * condNew 

If we fit this model for total price and game condition using simple linear regression, we 
obtain the following regression line estimate: 

totalPr = 42.87 + 10.90 * condNew (8.1) 

The 0-1 coding of the two- level categorical variable allows for a simple interpretation of 
the coefficient of condNew. When the game is in used condition, the condNew variable 
takes a value of zero, and the total auction price predicted from the model would be 
$42.87 + $10.90 * (0) = $42.87. If the game is in new condition, then the condNew variable 
takes value one and the total price is predicted to be $42.87 + $10.90 * (1) = $53.77. We 
now see clearly that the coefficient of condNew estimates the difference ($10.90) in the total 
auction price when the game is new ($53.77) versus used ($42.87). 



TIP: The coefficient of a two-level categorical variable 

The coefficient of a binary variable corresponds to the estimated difference in the 
outcome between the two levels of the variable. 

Q Exercise 8.2 The best fitting linear model for the outcome totalPr and predictor 
stockPhoto is 

totalPr = 44.33 + 4.17 * stockPhoto (8.3) 



where the variable stockPhoto takes value 1 when a stock photo is being used and 0 
when the photo is unique to that auction. Interpret the coefficient of stockPhoto. 



8.1. INTRODUCTION TO MULTIPLE REGRESSION 



311 



0 Example 8.4 In Exercise 8.2, you found that auctions whose primary photo was a 
stock photo tended to sell for about $4.17 more than auctions that feature a unique 
photo. Suppose a seller learns this and decides to change her Mario Kart Wii auction 
to have its primary photo be a stock photo. Will modifying her auction in this way 
earn her, on average, an additional $4.17? 



No, we cannot infer a causal relationship. It might be that there are inherent differ- 
ences in auctions that use stock photos and those that do not. For instance, if we 
sorted through the data, we would actually notice that many of the auctions with 
stock photos tended to also include more Wii wheels. In this case, Wii wheels is a 
potential lurking variable. 



8.1.2 Including and assessing many variables in a model 

Sometimes there is underlying structure or relationship between the predictor variables. 
For instance, new games sold on Ebay tend to come with more Wii wheels, which may have 
led to higher prices for those auctions. We would like to fit a model that included all po- 
tentially important variables simultaneously, which would help us evaluate the relationship 
between a predictor variable and the outcome while controlling for the potential influence 
of other variables. This is the strategy used in multiple regression. While we remain 
cautious about making any causal interpretations using multiple regression, such models 
are a common first step in providing evidence of a causal connection. 

Earlier we had constructed a simple linear model using condNew as a predictor and 
totalPr as the outcome. We also constructed a separate model using only stockPhoto as 
a predictor. Next, we want a model that uses both of these variables simultaneously and, 
while we're at it, we'll include the duration and wheels variables described Table 8.2: 



where y represents the total price, X\ is the game's condition, Xi is whether a stock photo 
was used, X3 is the duration of the auction, and £4 is the number of Wii wheels included 
with the game. Just as with the single predictor case, this model may be missing important 
components or it might not properly represent the relationship between the total price and 
the available explanatory variables. However, while no model is perfect, we wish to explore 
the possibility that this one may fit the data reasonably well. 

We estimate the parameters Po, Pi, Pa in the same way as we did in the case of a 
single predictor. We select bo, b\, 64 that minimize the sum of the squared residuals: 



totalPr 



V 



Po + Pi * condNew + P2 * stockPhoto 
+ /?3 * duration + /?4 * wheels 

Pa + P1X1 + P2X2 + P3X3 + P4X4 



(8.5) 



n 



n 





We typically use a computer to minimize this sum and compute point estimates, as shown 
in the sample output in Table 8.3. Using this output, we identify the point estimates bi of 
each Pi, just as we did in the one-predictor case. 
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-0.14 
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7.2852 


0.5547 


13.13 


0.0000 


df = 136 



Table 8.3: Output for the regression model where totalPr is the outcome 
and condNew, stockPhoto, duration, and wheels are the predictors. 



Multiple regression model 

A multiple regression model is a linear model with many predictors. In general, 
we write the model as 

y = Po + ftxi + /3 2 x 2 H h f3 p x p 

when there are p predictors. We often estimate the Pi parameters using a computer. 



Q Exercise 8.7 Write out the model in Equation (8.5) using the point estimates from 
Table 8.3. How many predictors are there in this model? Answers in the footnote 2 . 

Q Exercise 8.8 What does /?4, the coefficient of variable £ 4 (Wii wheels), represent? 
What is the point estimate of /?4? Answers in the footnote 3 . 

Q Exercise 8.9 Compute the residual of the first observation in Table 8. 1 on page 310. 
Hint: use the equation from Exercise 8.7. Answer in the footnote . 

0 Example 8.10 The coefficients for x\ (condNew) and x 2 (stockPhoto) are different 
than in the two simple linear models shown in Equations (8.1) and (8.3). Why might 
there be a difference? 

If we examined the data carefully, we would see that some predictors are correlated. 
For instance, when we estimated the connection of the outcome totalPr and predictor 
stockPhoto using simple linear regression, we were unable to control for other vari- 
ables like condNew. That model was biased by the lurking variable condNew. When 
we use both variables, this particular underlying and unintentional bias is reduced or 
eliminated (though bias from other lurking variables may still remain). 

Example 8.10 describes a common issue in multiple regression: correlation among 
predictor variables. We say the two predictor variables are collinear when they are corre- 
lated, and this collinearity complicates model estimation. While it is impossible to prevent 
collinearity from arising in observational data, experiments are usually designed to prevent 
predictors from being correlated. 

2 y = 36.21 + 5. 13a;i + 1.08x2 — 0.03^3 + 7.29x4, and there are p = 4 predictor variables. 

3 It is the average difference in auction price for each additional Wii wheel included when holding the 
other variables constant. The point estimate is 64 = 7.29. 

4 £i = Vi — Vi = 51.55 — 49.62 = 1.93, where 49.62 was computed using the predictor values for the 
observation and the equation identified in Exercise 8.7. 
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8.1.3 Adjusted R 2 as a better estimate of explained variance 

We first used R 2 in Section 7.2 to determine the amount of variability in the response that 
was explained by the model: 

^ 2 variability in residuals Var(ei) 

variability in the outcome Var(yi) 

where ej represents the residuals of the model and yi the outcomes. This equation remains 
valid in the multiple regression framework, but a small enhancement can often be even 
more informative. 

Q Exercise 8.11 The variance of the residuals for the model given in Exercise 8.9 is 
23.34, and the variance of the total price in all the auctions is 83.06. Verify the R 2 
for this model is 0.719. 

This strategy for estimating R 2 is okay when there is just a single variable. However, 
it becomes less helpful when there are many variables. The regular R 2 is actually a biased 
estimate of the amount of variability explained by the model. To get a better estimate, we 
use the adjusted R 2 . 

Adjusted R 2 as a tool for model assessment 
The adjusted R 2 is computed as 

2 _ Var(ei)/(n - p - 1) _ Varfe) n-l 
ad3 ~ Var{ Vl )/{n - 1) ~ Var{ Vl ) * n - p - 1 

where n is the number of cases used to fit the model and p is the number of 
predictor variables in the model. 

Because p is never negative, the adjusted R 2 will be smaller - often times just a little 
smaller - than the unadjusted R 2 . The reasoning behind the adjusted R 2 lies with the 
degrees of freedom associated with each variance . 

Q Exercise 8.12 There were n = 141 auctions in the marioKart data set and p = 4 
predictor variables in the model. Use n, p, and the variances from Exercise 8.11 to 
verify R 2 ^ = 0.711 for the Mario Kart model. 

Q Exercise 8.13 Suppose you added another predictor to the model, but the variance 
of the errors Var(e{) didn't go down. What would happen to the R 2 7 What would 
happen to the adjusted R 2 1 Answers in the footnote 6 . 

The idea that a predictor that doesn't explain any extra variance would actually "hurt" 
the adjusted R 2 highlights a common sentiment in statistics: avoid making a model more 
complicated than it needs to be. 

6 In multiple regression, the degrees of freedom associated with the variance of the estimate of the 
residuals is n — p — 1, not n — 1. For instance, if we were to make predictions for new data using our current 
model, we would find that the unadjusted R 2 is an overly optimistic estimate of the reduction in variance 
in the response, and using the degrees of freedom in the adjusted R 2 formula helps correct this bias. 

6 The unadjusted R 2 would stay the same and the adjusted R 2 would go down. 
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8.2 Model selection 

The best model is not always the most complicated. Sometimes including variables that 
are not evidently important can actually reduce the accuracy of predictions. In this section 
we discuss model selection strategies, which will help us eliminate variables that are less 
important from the model. 

In this section, and in practice, the model that includes all available explanatory 
variables is often referred to as the full model. Our goal is assess whether the full model 
is the best model. If it isn't, we want to identify a smaller model that is preferable. 

8.2.1 Identifying variables that may not be helpful in the model 

Table 8.4 provides a summary of the regression output for the full model. The last column 
of the table lists p-values that can be used to assess hypotheses of the following form: 

Hq'- ft = 0 when the other explanatory variables are included in the model. 
Ha'- Pi 7^ 0 when the other explanatory variables are included in the model. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


36.2110 


1.5140 


23.92 


0.0000 


condNew 


5.1306 


1.0511 


4.88 


0.0000 


stockPhoto 


1.0803 


1.0568 


1.02 


0.3085 


duration 


-0.0268 


0.1904 


-0.14 


0.8882 


wheels 


7.2852 


0.5547 


13.13 


0.0000 


df = 136 



Table 8.4: The fit for the full regression model. This table is identical to 
Table 8.3. 



0 Example 8.14 The coefficient of condNew has a t test statistic of T = 4.88 and a 

p-value for its corresponding hypotheses (Hq : ft = 0, Ha : ft ^ 0) of about zero. 
How can this be interpretted? 

If we keep all the other variables in the model and add no others, then there is strong 
evidence that a game's condition (new or used) has a real relationship with the total 
auction price. 

0 Example 8.15 Is there strong evidence that using a stock photo is related to the 
total auction price? 

The t test statistic for stockPhoto is T = 1.02 and the p-value is about 0.31. There 
is not strong evidence that using a stock photo in an auction is related to the total 
price of the auction. We might consider removing the stockPhoto variable from the 
model. 

Q Exercise 8.16 Identify the p-values for both the duration and wheels variables 
in the model. Is there strong evidence supporting the connection of these variables 
with the total price in the model? 
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There is not statistically significant evidence that either stockPhoto or duration are 
meaningfully contributing to the model. If the coefficients of these variables are not zero, 
their association with the outcome variable is probably weak. Next we consider common 
strategies for pruning such variables from a model. 

TIP: Using adjusted R 2 instead of p-values for model selection 

The adjusted R 2 may be used as an alternative to p-values for model selection, 
where a higher adjusted R 2 represents a better model fit. For instance, we could 
compare two models using their adjusted R 2 , and the model with the higher ad- 
justed R 2 would be preferred. This approach tends to include more variables in the 
final model when compared to the p-value approach. 



8.2.2 Two model selection strategies 

Two common strategies for adding or removing variables in a multiple regression model 
are called backward- selection and forward-selection. These techniques are often referred to 
as stepwise model selection strategies, because they add or delete one variable at a time 
as they "step" through the candidate predictors. We will discuss these strategies in the 
context of the p-value approach, however, the adjusted R 2 approach may be employed as 
an alternative. 

The backward-elimination strategy starts with the model that includes all potential 
predictor variables. One-by-one variables are eliminated from the model until only variables 
with statistically significant p-values remain. The strategy within each elimination step is 
to drop the variable with the largest p-value, refit the model, and reassess the inclusion of 
all variables. 

0 Example 8.17 Results corresponding to the full model for the marioKart data are 
shown in Table 8.4. How should we proceed under the backward-elimination strategy? 

There are two variables with coefficients that are not statistically different from zero: 
stockPhoto and duration. We first drop the duration variable since it has a larger 
corresponding p-value, then we refit the model. A regression summary for the new 
model is shown in Table 8.5. 

In the new model, there is not strong evidence that the coefficient for stockPhoto is 
different from zero (even though the p-value dropped a little) and the other p-values 
remain very small. So again we eliminate the variable with the largest non-significant 
p-value, stockPhoto, and refit the model. The updated regression summary is shown 
in Table 8.6. 

In the latest model, we see that the two remaining predictors have statistically signif- 
icant coefficients with p-values of about zero. Since there are no variables remaining 
that could be eliminated from the model, we stop. The final model includes only the 
condNew and wheels variables in predicting the total auction price: 

y = bo + b\X\ + 64X4 = 36.78 + 5.58xi + 7. 23a;4 

As an alternative description of how we could have performed this model selection 
strategy using adjusted R 2 , please see the footnote 7 . 

7 At each elimination step, we refit the model without each of the variables up for potential elimination 
(e.g. in the first step, we would fit four models, where each would be missing a different predictor). If 
one of these smaller models has a higher adjusted R 2 than our current model, we pick the smaller model 
with the largest adjusted R 2 . Had we used the adjusted R 2 criteria, we would have kept the stockPhoto 
variable in this backwards-elimination example. 
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Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


36.0483 


0.9745 


36.99 


0.0000 


condNew 


5.1763 


0.9961 


5.20 


0.0000 


stockPhoto 


1.1177 


1.0192 


1.10 


0.2747 


wheels 


7.2984 


0.5448 


13.40 


0.0000 



df = 137 



Table 8.5: The output for the regression model where totalPr is the out- 
come and the duration variable has been eliminated from the model. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


36.7849 


0.7066 


52.06 


0.0000 


condNew 


5.5848 


0.9245 


6.04 


0.0000 


wheels 


7.2328 


0.5419 


13.35 


0.0000 



df = 138 



Table 8.6: The output for the regression model where totalPr is the out- 
come and the duration and stock photo variables have been eliminated from 
the model. 

Notice that the p-value for stockPhoto changed a little from the full model (0.309) to 
the model that did not include the duration variable (0.275). It is common for p-values 
of one variable to change, due to collinearity, after eliminating a different variable. This 
fluctuation emphasizes the importance of refitting a model after each variable elimination 
step. The p-values tend to change dramatically when the eliminated variable is highly 
correlated with another variable in the model. 

The forward-selection strategy is the reverse of the backward-elimination technique. 
Instead of eliminating variables one-at-a-time, we add variables one-at-a-time until we 
cannot find any variables that present strong evidence of their importance in the model. 

0 Example 8.18 Construct a model for the marioKart data set using the forward- 
selection strategy. 

We start with the model that includes no variables. Then we fit each of the possible 
models with just one variable. That is, we fit the model including just the condNew 
predictor, then the model just including the stockPhoto variable, then a model with 
just duration, and a model with just wheels. Each of the four models (yes, we fit 
four models!) provides a p-value for the coefficient of the predictor variable. Out of 
these four variables, the wheels variable had the smallest p-value. Since its p-value 
is less than 0.05 (the p-value was smaller than 2e-16), we add the Wii wheels variable 
to the model. Once a variable is added in forward-selection, it will be included in all 
models considered and in the final model. 

Since we successfully found a first variable to add, we consider adding another. We 
fit three new models: (1) the model including just the condNew and wheels vari- 
ables (output in Table 8.6), (2) the model including just the stockPhoto and wheels 
variables, and (3) the model including only the duration and wheels variables. Of 
these models, the first had the lowest p-value for its new variable (the p-value cor- 
responding to condNew was 1.4e-08). Because this p-value is below 0.05, we add the 
condNew variable to the model. Now the final model is guaranteed to include both 
the condition and Wii wheels variables. 
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We repeat the process a third time, fitting two new models: (1) the model includ- 
ing the stockPhoto, condNew, and wheels variables (output in Table 8.5) and (2) 
the model including the duration, condNew, and wheels variables. The p-value cor- 
responding to stockPhoto in the first model (0.275) was smaller than the p-value 
corresponding to duration in the second model (0.682). However, since this smaller 
p-value was not below 0.05, there was not strong evidence that it should be included 
in the model. Therefore, neither variable is added and we are finished. 

The final model is the same as that arrived at using the backward-selection strategy: 
we include the condNew and wheels variables into the final model. See the footnote 
for how we would have proceded had we used the R^dj criteria instead of examining 
p- values . 





Model selection strategies 

The backward-elimination strategy begins with the largest model and eliminates 
variables one-by-one until we are satisfied that all remaining variables are impor- 
tant to the model. The forward-selection strategy starts with no variables included 
in the model, then it adds in variables according to their importance until no other 
important variables are found. 







There is no guarantee that the backward-elimination and forward-selection strategies 
will arrive at the same final model regardless of whether we are using the p-value or R ad j 
criteria. If the backwards-elimination and forward-selection strategies are both tried and 
they arrive at different models, one option is to choose between the models using the R^dj 
criteria (other options exist but are beyond the scope of this book). 

It is generally acceptable to use just one strategy, usually backward-elimination, and 
report the final model after verifying the conditions for fitting a linear model are reasonable. 

8.3 Checking model assumptions using graphs 

Multiple regression methods using the model 

y = P 0 + Pixx + f3 2 x 2 H h /3pX p 

generally depend on the following four assumptions: 

1. the residuals of the model are nearly normal, 

2. the variability of the residuals is nearly constant, 

3. the residuals are independent, and 

4. each variable is linearly related to the outcome. 

Simple and effective plots can be used to check each of these assumptions. 

8 Rather than look for variables with the smallest p-value, we look for the model with the largest R 2 ad j- 
Using the forward-selection strategy, we start with the model with no predictors. Next we look at each 
model with a single predictor. If one of these models has a larger R^ d j than the model with no variables, we 
use this new model. We repeat this procedure, adding one variable at a time, until we cannot find a model 
with a smaller R 2 ad j ■ If we had done the forward-selection strategy using R 2 ad j , we would have arrived at 
the model including condNew, stockPhoto, and wheels, which is a slightly larger model than we arrived at 
using the p-value approach. 
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Figure 8.7: A normal probability plot of the residuals is helpful in identifying 
observations that might be outliers. 

Normal probability plot. A normal probability plot of the residuals is shown in Fig- 
ure 8.7. While the plot exhibits some minor irregularities, there are no outliers that 
might be cause for concern. In a normal probability plot for residuals, we tend to 
be most worried about residuals that appear to be outliers, since these indicate long 
tails in the distribution of residuals. 

Absolute values of residuals against fitted values. A plot of the absolute value of 
the residuals against their corresponding fitted values (j/i) is shown in Figure 8.8. 
This plot is helpful to check the condition that the variance of the residuals is ap- 
proximately constant. We don't see any obvious deviations from constant variance in 
this example. 

Residuals in order of their data collection. A plot of the residuals in the order their 
corresponding auctions were observed is shown in Figure 8.9. Such a plot is helpful in 
identifying any connection between cases that are close to one another, e.g. we could 
look for declining prices over time or if there was a time of the day when auctions 
tended to fetch a higher price. Here we see no structure that indicates a problem 9 . 

Residuals against each predictor variable. We consider a plot of the residuals against 
the condNew variable and the residuals against the wheels variable. These plots are 
shown in Figure 8.10. For the two- level condition variable, we are guaranteed not 
to see a trend, and instead we are verifying that the variability doesn't fluctuate 
across groups. In this example, when we consider the residuals against the wheels 
variable, we see structure. There appears to be curvature in the residuals, indicating 
the relationship is probably not linear. 



9 An especially rigorous check would use time series methods. For instance, we could check whether 
consecutive residuals are correlated. Doing so with these residuals yields no statistically significant corre- 
lations. 
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Figure 8.8: Comparing the absolute value of the residuals against the fitted 
values (yi) is helpful in identifying deviations from the constant variance 
assumption. 
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Figure 8.9: Plotting residuals in the order that their corresponding obser- 
vations were collected helps identify connections between successive obser- 
vations. If it seems that consecutive observations tend to be close to each 
other, this indicates the independence assumption of the observations would 
fail. 
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Figure 8.10: In the two-level variable for the game's condition, we check for 
differences in distribution shape or variability. For numerical predictors, 
we also check for trends or other structure. We see some slight bowing in 
the residuals against the wheels variable. 
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It is necessary to summarize diagnostics for any model fit. If the diagnostics support 
the model assumptions, this would improve credibility in the findings. If the diagnostic 
assessment shows remaining underlying structure in the residuals, we may still report the 
model but must also note its shortcomings. In the case of the auction data, we report 
that there may be a nonlinear relationship between the total price and the number of 
wheels included for an auction. This information would be important to buyers and sellers; 
omitting this information could be a setback to the very people who the model might assist. 



"All models are wrong, but some are useful" -George E.P. Box 

The truth is that no model is perfect. However, even imperfect models can be 
useful. Reporting a flawed model can be reasonable so long as we are clear and 
report the model's shortcomings. 



Caution: Don't report results when assumptions are heavily violated 

While there is a little leeway in model assumptions, don't go too far. If model 
assumptions are grossly violated, consider a new model, even if it means learning 
more statistical methods or hiring someone who can help. 



TIP: Confidence intervals in multiple regression 

Confidence intervals for coefficients in multiple regression can be computed using 
the same formula as in the single predictor model: 

h ± t* df SE bi 

where t*g, is the appropriate t value corresponding to the confidence level and model 
degrees of freedom, df = n — p — 1. 



8.4 ANOVA and regression with categorical variables 

Fitting and interpreting models using categorical variables as predictors is similar to what 
we have encountered in simple and multiple regression. However, there is a twist: a single 
categorical variable will have multiple corresponding parameter estimates. To be precise, if 
the variable has C categories, then there will be C — 1 parameter estimates. Furthermore, 
it is not appropriate to use a Z or T score to determine the significance of the categorical 
variable as a predictor unless it only has C = 2 levels. 

In this section, we will learn a new method called analysis of variance (ANOVA) 
and a new test statistic called F. ANOVA is used to assess whether the mean of the 
outcome variable is different for different levels of a categorical variable: 

Hq: The mean outcome is the same across all categories. In statistical notation, fJ-i = fJ-2 = 
■ ■ ■ = fj-k where \Xi represents the mean of the outcome for observations in category i. 

Ha- The mean of the outcome variable is different for some (or all) groups. 
These hypotheses are used to evaluate a model of the form 



Vi,j — Mi £j 



(8.19) 



322 



CHAPTER 8. MULTIPLE REGRESSION AND ANOVA 



where an observation yij belongs to group i and has error tj. Generally we make three 
assumptions in applying this model: 

• the errors are independent, 

• the errors are nearly normal, and 

• the errors have nearly constant variance. 

These conditions probably look familiar: they are the same conditions we used for multiple 
regression. When these three assumptions are reasonable, we may perform an ANOVA to 
determine whether the data provide strong evidence against the null hypothesis that all 
the Hi are equal. 



TIP: Level, category, and group are synonyms 

We sometimes call the levels of a categorical variable its categories or its groups. 



0 Example 8.20 College departments commonly run multiple lectures of the same 
introductory course each semester because of high demand. Consider a statistics 
department that runs three lectures of an introductory statistics course. We might 
like to determine whether there are statistically significant differences in first exam 
scores in these three classes (A, B, and C). Describe how the model and hypotheses 
above could be used to determine whether there are any differences between the three 
classes. 

The hypotheses may be written in the following form: 

Hq: The average score is identical in all lectures. Any observed difference is due to 

chance. Notationally, we write [ia = Ms = f-c- 
Ha'- The average score varies by class. We would reject the null hypothesis in favor 

of this hypothesis if there were larger differences among the class averages than 

what we might expect from chance alone. 

We could label students in the first class as ua,i, Ua,2, Va,3, and so on - Students in 
the second class would be labeled ub,i, Ub,2, etc. And students in the third class: 
2/0,1; DC,2i etc. Then we could estimate the true averages (/M, [is, and fj,c) using the 
group averages: y A , Vb, and y c . 

Strong evidence favoring the alternative hypothesis in ANOVA is described by un- 
usually large differences among the group means. We will soon learn that assessing the 
variability of the group means relative to the variability among individual observations 
within each group is key to ANOVA's success. 

0 Example 8.21 Examine Figure 8.11. Compare groups I, II, and III. Can you visu- 
ally determine if the differences in the group centers is due to chance or not? Now 
compare groups IV, V, and VI. Do these differences appear to be due to chance? 

Any real difference in the means of groups I, II, and III is difficult to discern, because 
the data within each group are very volatile relative to any differences in the average 
outcome. On the other hand, it appears there are differences in the centers of groups 
IV, V, and VI. For instance, group IV appears to have a lower mean than that of 
the other two groups. Investigating groups IV, V, and VI, we see the differences in 
the groups' centers are noticeable because those differences are large relative to the 
variability in the individual observations within each group. 
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Figure 8.11: Side-by-side dot plot for the outcomes for six groups. 



8.4.1 Is batting performance related to player position in MLB? 

We would like to discern whether there are real differences between the batting performance 
of baseball players according to their position: outfielder (OF), infielder (IF), designated 
hitter (DH), and catcher (C). We will use a data set called mlbBatlO, which includes batting 
records of 327 Major League Baseball (MLB) players from the 2010 season. Six of the 327 
cases represented in mlbBatlO are shown in Table 8.12, and descriptions for each variable 
are provided in Table 8.13. The measure we will use for the player batting performance (the 
outcome variable) is on-base percentage (OBP). The on-base percentage roughly represents 
the fraction of the time a player successfully gets on base or hits a home run. 



name team position AB H HR RBI AVG OBP 



1 


I Suzuki 


SEA 


OF 


680 


214 


6 


43 


0.315 


0.359 


2 


D Jeter 


NYY 


IF 


663 


179 


10 


67 


0.270 


0.340 


3 


M Young 


TEX 


IF 


656 


186 


21 


91 


0.284 


0.330 


325 


B Molina 


SF 


C 


202 


52 


3 


17 


0.257 


0.312 


326 


J Thole 


NYM 


c 


202 


56 


3 


17 


0.277 


0.357 


327 


C Heisey 


CIN 


OF 


201 


51 


8 


21 


0.254 


0.324 



Table 8.12: Six cases from the mlbBatlO data matrix. 



Q Exercise 8.22 The null hypothesis under consideration is the following: fi 0F = 
Mif = Mdh = A*c- Write the null and corresponding alternative hypotheses in plain 
language. Answers in the footnote 10 . 



10 Hq: The average on-base percentage is equal across the four positions. H^: The average on-base 
percentage varies across some (or all) groups. 



324 



CHAPTER 8. MULTIPLE REGRESSION AND ANOVA 



variable 


description 


name 


Player name 


team 


The player's team, where the team names are abbreviated 


position 


The player's primary field position (OF, IF, DH, C) 


AB 


Number of opportunities at bat 


H 


Number of hits 


HR 


Number of home runs 


RBI 


Number of runs batted in 


batAverage 


Batting average, which is equal to H/AB 



Table 8.13: Variables and their descriptions for the mlbBatlO data set. 



0 Example 8.23 The player positions have been divided into four groups: outfield 
(OF), infield (IF), designated hitter (DH), and catcher (C). What would be an appro- 
priate point estimate of the batting average by outfielders, //of? 

A good estimate of the batting average by outfielders would be the sample average 
of batAverage for just those players whose position is outfield: yoF — 0.334. 

Table 8.14 provides summary statistics for each group. A side-by-side box plot for 
the batting average is shown in Figure 8.15. Notice that the variability appears to be ap- 
proximately constant across groups; nearly constant variance across groups is an important 
assumption that must be satisfied before we consider the ANOVA approach. 





OF 


IF 


DH 


C 


Sample size (ni) 


120 


154 


14 


39 


Sample mean (y~i) 


0.334 


0.332 


0.348 


0.323 


Sample SD ( Si ) 


0.029 


0.037 


0.036 


0.045 



Table 8.14: Summary statistics of on-base percentage, split by player posi- 
tion. 



0 Example 8.24 The largest difference between the sample means is between the 
designated hitter and the catcher positions. Consider again the original hypotheses: 

Ho- Mqf = Mif = = Mc 

Ha- The average on-base percentage varies across some (or all) groups. 

Why might it be inappropriate to run the test by simply estimating whether the 
difference of /t DH and /i c is statistically significant at a 0.05 significance level? 

The primary issue here is that we are inspecting the data before picking the groups 
that will be compared. It is inappropriate to examine all data by eye (informal 
testing) and only afterwards decide which parts to formally test. This is called data 
snooping or data fishing. Naturally we would pick the groups with the large 
differences for the formal test, leading to an unintentional inflation in the Type 1 
Error rate. To understand this better, let's consider a slightly different problem. 

Suppose we are to measure the aptitude for students in 20 classes in a large elementary 
school at the beginning of the year. In this school, all students are randomly assigned 
to classrooms, so any differences we observe between the classes at the start of the 
year are completely due to chance. However, with so many groups, we will probably 
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Figure 8.15: Side-by-side box plot of the on-base percentage for 327 players 
across four groups. 



observe a few groups that look rather different from each other. If we select only 
these classes that look so different, we will probably make the wrong conclusion that 
the assignment wasn't random. While we might only formally test differences for a 
few pairs of classes, we informally evaluated the other classes by eye before choosing 
the most extreme cases for a comparison. 



For additional reading on the ideas expressed in Example 8.24, we recommend reading 
about the prosecutor's fallacy 11 . 

In the next section we will learn how to use the F statistic and ANOVA to test whether 
differences in means could have happened just by chance. 



8.4.2 Analysis of variance (ANOVA) and the F test 

The method of analysis of variance focuses on answering one question: is the variability in 
the sample means so large that it seems unlikely to be from chance alone? This question is 
different from earlier testing procedures since we will simultaneously consider many groups, 
and evaluate whether their sample means differ more than we would expect from natural 
variation. We call this variability the mean square between groups (MSG), and it has 
an associated degrees of freedom, dfc = k — 1 when there are k groups. The MSG is sort 
of a scaled variance formula for means. If the null hypothesis is true, any variation in the 
sample means is due to chance and shouldn't be too large. Details of MSG calculations 



11 See, for example, http://www.stat.columbia.edu/-cook/movabletype/archives/2007/05/the 
prosecutors . html. 
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are provided in the footnote 12 , however, we typically use software for these computations. 

The mean square between the groups is, on its own, quite useless in a hypothesis 
test. We need a benchmark value for how much variability should be expected among 
the sample means if the null hypothesis is true. To this end, we compute the mean of 
the squared errors, often abbreviated as the mean square error (RISE), which has an 
associated degrees of freedom value dfE = n—k. It is helpful to think of MSE as a measure 
of the variability of the residuals. Details of the computations of the MSE are provided in 
the footnote 13 for the interested reader. 

When the null hypothesis is true, any differences among the sample means are only 
due to chance, and the MSG and MSE should be about equal. As a test statistic for 
ANOVA, we examine the fraction of MSG and M SE: 

MSG 

= MSE ^ 

The MSG represents a measure of the between-group variability, and MSE the variability 
within each of the groups. 

O Exercise 8.26 For the baseball data, MSG = 0.00252 and MSE = 0.00127. 
Identify the degrees of freedom associated with each mean square and verify the F 
statistic is 1.994. 



We use the F statistic to evaluate the hypotheses in what is called an F test. We 
compute a p-value from the F statistic using an F distribution, which has two associated 
parameters: df\ and df 2 - For the F statistic in ANOVA, dfi = dfc and df 2 = dfE- An 
F distribution with 3 and 323 degrees of freedom, corresponding to the F statistic for the 
baseball hypothesis test, is shown in Figure 8.16. 

The larger the observed variability in the sample means (MSG) relative to the residu- 
als (MSE), the larger F will be and the stronger the evidence against the null hypothesis. 
Because larger values of F represent stronger evidence against the null hypothesis, we use 
the upper tail of the distribution to compute a p-value. 

12 Let y represent the mean of outcomes across all groups. Then the mean square between groups is 
computed as 

k 

MSG = —SSG = - 1 — V m (yi - yf 
d fa k ~ l 7^i 

where SSG is called the sum of squares between groups and n; is the sample size of group i. 

13 Let y represent the mean of outcomes across all groups. Then the sum of squares total (SST) is 
computed as 

n 

sst = (vi - vf 

where the sum is over all observations in the data set. Then we compute the sum of squared errors 
(SSE) in one of three equivalent ways: 

SSE = SST - SSG 

= (ni — 1) * sf + (ri2 — 1) * s% -\ + (n k - 1) * s\ 

where s| is the sample variance (square of the standard deviation) of the residuals in group i, and the last 
expression represents the sum of the squared residuals across all groups. Then the MSE is the standardized 
form of SSE: MSE = -r±-SSE. 
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The F statistic and the F test 

Analysis of variance (ANOVA) is used to test whether the mean outcome differs 
across 2 or more groups. ANOVA uses a test statistic F, which represents a 
standardized ratio of variability in the sample means relative to the variability of 
the residuals. If Hq is true and the model assumptions are satisfied, the statistic 
F follows an F distribution with parameters dfi = k — 1 and df 2 = n — k. The 
upper tail of the F distribution is used to represent the p- value. 



Q Exercise 8.27 The test statistic for the baseball example is F = 1.994. Shade the 
area corresponding to the p- value in Figure 8.16. 

0 Example 8.28 The p-value corresponding to the solution for Exercise 8.27 is equal 
to about 0.115. Does this provide strong evidence against the null hypothesis? 

The p-value is larger than 0.05, indicating the evidence is not sufficiently strong 
to reject the null hypothesis at a significance level of 0.05. That is, the data do 
not provide strong evidence that the average on-base percentage varies by player's 
primary field position. 

8.4.3 Reading regression and ANOVA output from software 

The calculations required to perform an ANOVA by hand are tedious and prone to human 
error. For these reasons it is common to use a statistical software to calculate the F statistic 
and p-value. 

An ANOVA can be summarized in a table very similar to that of a regression summary. 
Table 8.17 shows an ANOVA summary to test whether the mean of on-base percentage 
varies by player positions in the MLB. 

Q Exercise 8.29 Earlier you verified that the F statistic for this analysis was 1.994, 
and the p-value of 0.115 was provided. Circle these values in Table 8.17 and notice the 
corresponding column name. Notice that both of these values are in the row labeled 
position, which corresponds to the categorical variable representing the player position 
variable. 
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Df 


Sum Sq 


Mean Sq 


F value 


Pr(>F) 


position 


3 


0.0076 


0.0025 


1.9943 


0.1147 


Residuals 


323 


0.4080 


0.0013 












Spooled — 


0.036 on 


df = 323 



Table 8.17: ANOVA summary for testing whether the average on-base per- 
centage differs across player positions. 

O Exercise 8.30 The 

Spooled — 0.036 on df — 323 describes the estimated standard 
deviation associated with the residuals. Verify that s poo i ec i equals the square root of 
the MSE for the Residuals row. 

8.4.4 Graphical diagnostics for an ANOVA analysis 

There are three primary conditions we must check for an ANOVA analysis, all related to 
the residuals (errors) associated with the model. Recall that we assume the errors are 
independent, nearly normal, and have nearly constant variance across the groups. 

Independence. If observations are collected in a particular order, we should plot the resid- 
uals in the order the corresponding observations were collected (e.g. see Figure 8.9 on 
page 319). For the baseball data, the data were collected from a sorted table, making 
such a review impossible. However, we can consider the nature of the data: Do we 
have reason to believe players are not independent? There are not obvious reasons 
why independence should not hold, so we will assume independence is reasonable in 
lieu of being able to examine this condition using data. 

Approximately normal. The normality assumption for the residuals is especially impor- 
tant when the sample size is quite small. Figure 8.18 shows a normal probability plot 
for the residuals from the baseball data. We do see some deviation from normality 
at the low end, where there is a longer tail than what we would expect if the resid- 
uals were truly normal. While we should report this finding with the results of the 
hypothesis test, this slight deviation probably has little impact on the test results 
since there are so many players included in the sample and they are not spread thinly 
across many groups. 

Constant variance. The last assumption is that the variance associated with the residuals 
is nearly constant from one group to the next. This assumption can be checked by 
examining a side-by-side box plot of the outcomes, as in Figure 8.15. In this case, 
the variability is similar in the four groups but not identical. We see in Table 8.14 
on page 324 that the standard deviation varies a bit from one group to the next. 
Whether these differences are from natural variation is unclear, so we should report 
this uncertainty with the final results. 



Caution: Diagnostics for an ANOVA analysis 

Independence is always important to an ANOVA analysis. The normality condition 
is very important when the sample sizes for each group are relatively small. The 
constant variance condition is especially important when the sample sizes differ 
between groups. 
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8.4.5 Multiple comparisons and controlling Type 1 Error rate 

When we reject the null hypothesis in an ANOVA analysis, we might wonder, which of 
these groups have different means? To answer this question, we compare the means of each 
possible pair of groups. For instance, if there are three groups and there is strong evidence 
that there are some differences in the group means, there are three comparisons to make: 
group 1 to group 2, group 1 to group 3, and group 2 to group 3. These comparisons can 
be accomplished using a two-sample t test, but we must use a modified significance level 
and a pooled estimate of the standard deviation across groups. 

0 Example 8.31 Example 8.20 on page 322 discussed three statistics lectures, all 
taught during the same semester. Table 8.19 shows summary statistics for these 
three courses, and a side-by-side box plot of the data is shown in Figure 8.20. We 
would like to conduct an ANOVA for these data. Do you see any deviations from the 
three conditions for ANOVA? 

In this case (like many others) it is difficult to check independence in a rigorous way. 
Instead, the best we can do is use common sense to consider reasons the assumption 
of independence may not hold. For instance, the independence assumption may not 
be reasonable if there is a star teaching assistant that only half of the students may 
access; such a scenario would divide a class into two subgroups. After carefully 
considering the data, we believe that assuming independence may be acceptable. 

The distributions in the side-by-side box plot appear to be roughly symmetric and 
show no noticeable outliers. 

The box plots show approximately equal variability, which can be verified in Ta- 
ble 8.19, supporting the constant variance assumption. 
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Class i 


A 


B 


C 


Tli 


58 


55 


51 


Vi 


75.1 


72.0 


78.9 


Si 


13.9 


13.8 


13.1 



Table 8.19: Summary statistics for the first midterm scores in three different 
lectures of the same course. 
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Lecture 



Figure 8.20: Side-by-side box plot for the first midterm scores in three 
different lectures of the same course. 



Q Exercise 8.32 An ANOVA was conducted for the midterm data, and a summary 
is shown in Table 8.21. What should we conclude? 





Df 


Sum Sq 


Mean Sq 


F value 


Pr(>F) 


lecture 


2 


1290.11 


645.06 


3.48 


0.0330 


Residuals 


161 


29810.13 


185.16 












Spooled — 


13.61 on 


df = 161 



Table 8.21: ANOVA summary table for the midterm data. 



There is strong evidence that the different means in each of the three classes is not 
simply due to chance. We might wonder, which of the classes are actually different? As 
discussed in earlier chapters, a two-sample t test could be used to test for differences in each 
possible pair of groups. However, one pitfall was discussed in Example 8.24 on page 324: 
when we run so many tests, the Type 1 Error rate increases. This issue is resolved by using 
a modified significance level. 
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Multiple comparisons and the Bonferroni correction for a 

The scenario of testing many pairs of groups is called multiple comparisons. 
The Bonferroni correction suggests that a more stringent significance level is 
more appropriate for these tests: 

a* = a/K 

where K is the number of comparisons being considered (formally or informally). 
If there are k groups, then usually all possible pairs are compared and K = k ^ k ~ x ^ , 



0 Example 8.33 In Exercise 8.32, you found that the data showed strong evidence of 
differences in the average midterm grades between the three lectures. Complete the 
three possible pairwise comparisons using the Bonferroni correction and report any 
differences. 



We use a modified significance level of a* = 0.05/3 = 0.0167. Additionally, we use 
the pooled estimate of the standard deviation: s poo i ec i = 13.61 on df = 161. 

Lecture A versus Lecture B: The estimated difference and standard error are, respec- 
tively, 

/ 13 61 2 13 61 2 

y A -y B = 75.1 - 72 = 3.1 SE = \ — + — = 2.56 

y y V 58 55 

(See Section 6.2.4 on page 6.2.4 for additional details.) This results in a T score of 
1.21 on df = 161 (we use the df associated with s poo i e d) and a two-tailed p-value of 
0.228. This p-value is larger than a* = 0.0167, so there is not strong evidence of a 
difference in the means of lectures A and B. 

Lecture A versus Lecture C: The estimated difference and standard error are 3.8 and 
2.61, respectively. This results in a T score of 1.46 on df = 161 and a two-tailed 
p-value of 0.1462. This p-value is larger than a* , so there is not strong evidence of a 
difference in the means of lectures A and C. 

Lecture B versus Lecture C: The estimated difference and standard error are 6.9 and 
2.65, respectively. This results in a T score of 2.60 on df = 161 and a two-tailed 
p-value of 0.0102. This p-value is smaller than a* . Here we find strong evidence of a 
difference in the means of lectures B and C. 

We might summarize the findings of the analysis from Example 8.33 using the following 
notation: 

? ? 
MA = Mb Ma = Mc Mb T Mc 

The midterm mean in lecture A is not statistically distinguishable from those of lectures 
B or C. However, there is strong evidence that lectures B and C are different. In the first 
two pairwise comparisons, we did not have sufficient evidence to reject the null hypothesis. 
Recall that failing to reject H 0 does not imply H 0 is true. 
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Caution: Sometimes an ANOVA will reject the null but no groups will 
have statistically significant differences 

It is possible to reject the null hypothesis using ANOVA and then to not subse- 
quently identify differences in the pairwise comparisons. However, this does not 
invalidate the ANOVA conclusion. It only means we have not been able to success- 
fully identify which groups differ in their means. 



The ANOVA procedure examines the big picture: it considers all groups simultane- 
ously to decipher whether there is evidence that some difference exists. Even if the test 
indicates that there is strong evidence of differences in group means, identifying with high 
confidence a specific difference as statistically significant is more difficult. 

Consider the following analogy: we observe a Wall Street firm that makes large quanti- 
ties of money based on predicting mergers. Mergers are generally difficult to predict, and if 
the prediction success rate is extremely high, that may be considered sufficiently strong ev- 
idence to warrant investigation by the Securities and Exchange Commission (SEC). While 
the SEC may be quite certain that there is insider trading taking place at the firm, the 
evidence against any single trader may not be very strong. It is only when the SEC consid- 
ers all the data that they identify the pattern. This is effectively the strategy of ANOVA: 
stand back and consider all the groups simultaneously. 

8.4.6 Using ANOVA for multiple regression 

The ANOVA methodology can be extended to multiple regression, where we simultaneously 
incorporate categorical and numerical predictors into a model. The methods discussed so 
far - an outcome for a single categorical variable - is called one-way ANOVA. There are 
two extensions that we briefly discuss here: evaluating all variables in a model simultane- 
ously, and using ANOVA in model selection where some variables are numerical and others 
categorical. 

Some software will supply additional information about a multiple regression model fit 
beyond the regression summaries described in this textbook. This additional information 
can be used in an assessment of the utility of the full model. For instance, below is the full 
regression summary for the Mario Kart Wii game analysis from Section 8.2 (implemented 
with R statistical software 14 ) using all four predictors: 

Residuals : 

Min 1Q Median 3Q Max 

-11.3788 -2.9854 -0.9654 2.6915 14.0346 



Coefficients : 

Estimate Std. Error t value Pr(>|t|) 



(Intercept) 


36 


21097 


1 .51401 


23 


917 


< 2e-16 


*** 


condNew 


5 


13056 


1.05112 


4 


881 


2.91e-06 


*** 


stockPhoto 


1 


08031 


1.05682 


1 


022 


0.308 




duration 


-0 


02681 


0. 19041 


-0 


141 


0.888 




wheels 


7 


28518 


0.55469 


13 


134 


< 2e-16 


*** 


Signif . codes: 


0 *** 


0.001 ** 0 


01 


* 0 


05 . 0.1 


1 



R is free and can be downloaded at www.r-project.org. 
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Residual standard error: 4.901 on 136 degrees of freedom 
Multiple R-squared: 0 . 719 .Adjusted R-squared: 0.7108 
F-statistic: 87.01 on 4 and 136 DF, p-value : < 2.2e-16 

The main output labeled Coefficients should be familiar as the multiple regression sum- 
mary. The last three lines are new and provide details about 

• the standard deviation associated with the residuals (4.901), 

• degrees of freedom (136), 

• R 2 (0.719) and adjusted R 2 (0.7108), and 

• also an F statistic (174.4 with df± = 4 and df2 = 136) with an associated p-value 
(<2.2e-16, i.e. about zero). 

The F statistic and p-value in the last line can be used for a test of the entire model. The 
p-value can be used to the answer the following question: Is there strong evidence that the 
model as a whole is significantly better than using no variables? In this case, with a p-value 
of less than 2.2 x 10 -16 , there is extremely strong evidence that the variables included 
are helpful in prediction. Notice that the p-value does not verify that all variables are 
actually important in the model; it only considers the importance of all of of the variables 
simultaneously. This is similar to how ANOVA was earlier used to assess differences across 
all means without saying anything about the difference between a particular pair of means. 

The second setting for ANOVA in the general multiple regression framework is one 
that is more delicate: model selection. We could compare the variability in the residuals of 
two models that differ by just one predictor using ANOVA as a tool to evaluate whether 
the data support the inclusion of that variable in the model. We postpone further details 
of this method to a later course. 
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8.5 Exercises 

8.5.1 Introduction to multiple regression 

8.1 In Chapter 6 you were introduced to a data set from an experiment to measure 
and compare the effectiveness of various feed supplements on the growth rate of chickens. 
Newly hatched chicks were randomly allocated into six groups, and each group was given 
a different feed supplement. Their weights in grams after six weeks are given along with 
feed types in the data set called chickwts. We are specifically interested in the effect of 
casein feed on the weights of these chicks, so we have created a variable called casein and 
coded chicks who were on casein feed as 1 and those who were on other diets as 0. The 
summary table below shows the results of a simple linear regression model for predicting 
weight from casein. [38] 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


248.64 


9.54 


26.06 


0.0000 


casein 


74.94 


23.21 


3.23 


0.0019 



(a) Write the equation of the regression line. 

(b) Interpret the slope in context, and calculate the predicted weight of chicks who are and 
who not are on another feed. 

(c) Is there a statistically significant relationship between feed type (casein or other) and 
the average weight of chicks? State the hypotheses and include any information used to 
conduct the test. Note that if we look back at Exercise 6.19 on page 268, we would see 
that the variability within the casein group and the variability across the other groups 
are about equal and the distributions symmetric. With these conditions satisfied, it is 
reasonable to proceed with the test. (Note also that we don't need to check linearity 
since the predictor has only two levels.) 

8.2 Vitamin C is believed to help promote dental health. One common way to get 
Vitamin C is by drinking orange juice. Another option is to take ascorbic acid tablets. 
An experiment was conducted to test if one source is more effective than the other. 60 
guinea pigs were randomly assigned to these two delivery methods for Vitamin C, 30 in 
each group. The length of teeth in millimeters are given along with delivery methods in the 
data set called ToothGrowth. We created a variable called 0J and coded guinea pigs who 
were given orange juice as 1 and those who were given ascorbic acid as 0. The summary 
table below shows the results of a simple linear regression model for predicting the average 
tooth length, len, from 0J. [51] 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


16.96 


1.37 


12.42 


0.0000 


oj 


3.70 


1.93 


1.92 


0.0604 



(a) Write the equation of the regression line. 

(b) Interpret the slope in context, and calculate the predicted tooth length for guinea pigs 
who were given orange juice and those who were given ascorbic acid. 

(c) Is there a statistically significant relationship between the average tooth length and 
delivery method of Vitamin C in guinea pigs? State the hypotheses and include any 
information used to conduct the test. Note that the variability within the orange juice 
and the ascorbic acid groups are about equal and the distributions symmetric. With 
these conditions satisfied, it is reasonable to proceed with the test. 
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8.3 The Child Health and Development Studies (CHDS) is a collection of studies, one 
of which considers all pregnancies between 1960 and 1967 among women in the Kaiser 
Foundation Health Plan in the San Francisco East Bay area. A random sample of these 
data are given in a data set called babies. We consider the relationship between smoking 
and weight of the baby. The variable smoke is coded 1 if the mother is a smoker, and 0 
if not. The summary table below shows the results of a simple linear regression model for 
predicting the average birth weight of babies, measured in ounces (bwt), from smoke. [52] 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


123.05 


0.65 


189.60 


0.0000 


smoke 


-8.94 


1.03 


-8.65 


0.0000 



The variability within the smokers and non-smokers are about equal and the distributions 
symmetric. With these conditions satisfied, it is reasonable to proceed with the test. (Note 
that we don't need to check linearity since the predictor has only two levels.) 

(a) Write the equation of the regression line. 

(b) Interpret the slope in context, and calculate the predicted birth weight of babies born 
to smoker and non-smoker mothers. 

(c) Is there a statistically significant relationship between the average birth weight and 
smoking? State the hypotheses and include any information used to conduct the test. 

8.4 Exercise 8.3 introduces a data set on birth weight of babies. Another variable we 
consider is parity, where 0 is first born, and 1 is otherwise. The summary table below 
shows the results of a simple linear regression model for predicting the average birth weight 
of babies, measured in ounces, from parity. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


120.07 


0.60 


199.94 


0.0000 


parity 


-1.93 


1.19 


-1.62 


0.1052 



(a) Write the equation of the regression line. 

(b) Interpret the slope in context, and calculate the predicted birth weight of first borns 
and others. 



(c) Is there a statistically significant relationship between the average birth weight and 
parity? State the hypotheses and include any information used to conduct the test. 
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8.5 The babies dataset used in Exercises 8.3 and 8.4 includes information on length of 
pregnancy in days (gestation), mother's age in years (age), mother's height in inches 
(height), and mother's pregnancy weight in pounds (weight), in addition to the smoking 
and parity variables considered earlier. Below are three observations from this data set. 





bwt 


gestation 


parity 


age 


height 


weight 


smoke 


1 


120 


284 


0 


27 


62 


100 


0 


2 


113 


282 


0 


33 


64 


135 


0 


1236 


117 


297 


0 


38 


65 


129 


0 



The summary table below shows the results of a linear regression model for predicting the 
average birth weight of babies based on all of the variables included in the data set. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


-80.41 


14.35 


-5.60 


0.0000 


gestation 


0.44 


0.03 


15.26 


0.0000 


parity 


-3.33 


1.13 


-2.95 


0.0033 


age 


-0.01 


0.09 


-0.10 


0.9170 


height 


1.15 


0.21 


5.63 


0.0000 


weight 


0.05 


0.03 


1.99 


0.0471 


smoke 


-8.40 


0.95 


-8.81 


0.0000 



(a) Write the equation of the regression line that includes all of the variables. 

(b) Interpret the slopes of gestation and age in context. 

(c) The coefficient for parity is different than in the simple linear model shown in Exer- 
cise 8.4. Why might there be a difference? 

(d) Calculate the residual for the first observation in the data set. 

8.6 Researchers interested in the relationship between absenteeism from school and certain 
demographic characteristics of children collected data from 146 randomly sampled students 
in rural New South Wales in a particular school year. These data are given in a data set 
called quine. Below are three observations from this data set. 





eth 


sex 


lrn 


days 


1 


0 


1 


1 


2 


2 


0 


1 


1 


11 


146 


1 


0 


0 


37 



The summary table below shows the results of a linear regression model for predicting the 
average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not 
aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 
- slow learner). [53] 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


18.93 


2.57 


7.37 


0.0000 


eth 


-9.11 


2.60 


-3.51 


0.0000 


sex 


3.10 


2.64 


1.18 


0.2411 


lrn 


2.15 


2.65 


0.81 


0.4177 
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(a) Write the equation of the regression line. 

(b) Interpret each one of the slopes in context. 

(c) Calculate the residual for the first observation in the data set. 

8.7 The variance of the residuals for the model given in Exercise 8.5 is 249.28, and the 
variance of the birth weights of all babies in the data set is 332.57. Calculate the R 2 and 
the adjusted R 2 . Note that there are 1236 observations in the data set. 

8.8 The variance of the residuals for the model given in Exercise 8.6 is 240.57, and the 
variance of the number of absent days for all students in the data set is 264.17. Calculate 
the R 2 and the adjusted R 2 . Note that there are 146 observations in the data set. 

8.5.2 Model selection 

8.9 Exercise 8.5 presents summary output for a regression model for predicting the average 
birth weight of babies based on six explanatory variables. 

(a) Determine which variable(s) do not have a significant relationship with the outcome 
and should be candidates for removal from the model. If there is more than one such 
model, indicate which one should be removed first. 

(b) The summary table below shows the results of the regression we refit after removing 
age from the model. Determine if any other variable(s) should be removed from the 
model. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


-80.64 


14.04 


-5.74 


0.0000 


gestation 


0.44 


0.03 


15.28 


0.0000 


parity 


-3.29 


1.06 


-3.10 


0.0020 


height 


1.15 


0.20 


5.64 


0.0000 


weight 


0.05 


0.03 


2.00 


0.0459 


smoke 


-8.38 


0.95 


-8.82 


0.0000 



8.10 Exercise 8.6 presents summary output for a regression model for predicting the 
average number of days absent based on three explanatory variables. 

(a) Determine which variable(s) do not have a significant relationship with the outcome 
and should be candidates for removal from the model. If there is more than one such 
model, indicate which one should be removed first. 

(b) The summary table below shows the results of the regression we refit after removing 
learner status from the model. Determine if any other variable(s) should be removed 
from the model. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


19.98 


2.22 


9.01 


0.0000 


eth 


-9.06 


2.60 


-3.49 


0.0006 


sex 


2.78 


2.60 


1.07 


0.2878 
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8.11 Exercise 8.5 provides regression output for the full model (including all explanatory 
variables available in the data set) for predicting birth weight of babies. In this exercise, 
we consider a forward-selection algorithm and add variables to the model one-at-a-time. 
The table below shows the p-value and adjusted R 2 of each model where we include only 
the corresponding predictor. Based on this table, which variable should be added to the 
model first? 



variable 


gestation 


parity 


age 


height 


weight 


smoke 


p-value 


2.2 x KT 16 


0.1052 


0.2375 


2.97 x 10~ 12 


8.2 x KT S 


2.2 x 10~ 1B 


n adj 


0.1657 


0.0013 


0.0003 


0.0386 


0.0229 


0.0569 



8.12 Exercise 8.6 provides regression output for the full model (including all explanatory 
variables available in the data set) for predicting number of days absent from school. In 
this exercise, we consider a forward-selection algorithm and add variables to the model 
one-at-a-time. The table below shows the p-value and adjusted R 2 of each model where 
we include only the corresponding predictor. Based on this table, which variable should be 
added to the model first? 



variable 


ethnicity 


sex 


leaner status 


p-value 


0.0007 


0.3142 


0.5870 




0.0714 


0.0001 


0 



8.5.3 Checking model assumptions using graphs 

8.13 Exercise 8.9 presents a regression model for predicting the average birth weight of 
babies based on length of gestation, parity, height, weight, and smoke. Determine if the 
model assumptions are met using the plots below. If not, describe how to proceed with the 
analysis. 
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Parity 



55 60 65 70 

Height of mother 




40 



-40 



100 150 200 

Weight of mother 



250 



Smoke 



8.14 The table below presents summary output for a regression model for predicting the 
average GPA based IQ and gender, where 0 represents a female and 1 represents a male. 





Estimate 


Std. Error 


t value 


Pr(>|t|) 


(Intercept) 


-4.70 


1.56 


-3.01 


0.0035 


iq 


0.11 


0.01 


7.77 


0.0000 


gender 


0.97 


0.37 


2.60 


0.0111 



The p-values in this table suggest a significant relationship between GPA and the predic- 
tors, IQ and gender. Using the plots given below, determine if this regression model is 
appropriate for these data. 



-2-10 1 2 
Theoretical Quantiles 



15; 



Fitted values 



40 

Order of collection 



340 



CHAPTER 8. MULTIPLE REGRESSION AND ANOVA 



8.5.4 ANOVA and regression with categorical variables 

8.15 In Exercise 8.1, we considered the effect of casein feed on chicks' weight. Instead of 
categorizing feed type as casein or other, we might also want to consider all feed types at 
once: casein, horsebean, linseed, meat meal, soybean, and sunflower. The ANOVA output 
below can be used to test for differences between the average weights of chicks on different 
diets. 





Df 


Sum Sq 


Mean Sq 


F value 


Pr(>F) 


feed 


5 


231129.16 


46225.83 


15.36 


0.0000 


Residuals 


65 


195556.02 


3008.55 







Conduct a hypothesis test to determine if these data provide strong evidence that the 
average weight of chicks varies across some (or all) groups. Refer to Exercise 6.19 on 
page 268 to assist in checking ANOVA conditions. 

8.16 A professor who teaches a large introductory statistics class with eight discussion 
sections would like to test if student performance differs by discussion section. Each discus- 
sion section has a different teaching assistant. The summary table below shows the average 
final exam score for each discussion section as well as the standard deviation of scores and 
the number of students in each section. 





Sec 1 


Sec 2 


Sec 3 


Sec 4 


Sec 5 


Sec 6 


Sec 7 


Sec 8 


rii 


33 


19 


10 


29 


33 


10 


32 


31 


X i 


92.94 


91.11 


91.80 


92.45 


89.30 


88.30 


90.12 


93.35 


St 


4.21 


5.58 


3.43 


5.92 


9.32 


7.27 


6.93 


4.57 



The ANOVA output below can be used to test for differences between the average scores 
from the different discussion sections. 





Df 


Sum Sq 


Mean Sq 


F value 


Pr(>F) 


section 


7 


525.01 


75.00 


1.87 


0.0767 


Residuals 


189 


7584.11 


40.13 







Conduct a hypothesis test to determine if these data provide strong evidence that the aver- 
age score varies across some (or all) groups. Check conditions and describe any assumptions 
you must make to conduct the test. 
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Appendix B 

End of chapter exercise 
solutions 



1 Introduction to data 

1.1 (a) Control: 56%. Treatment: 70%. (b) 
There is a 14% difference between the pain 
reduction rates in the two groups. It ap- 
pears that patients in the treatment group 
are more likely to show improvement and, 
at a first glance, acupuncture appears to be 
an effective treatment for migraines, (c) It's 
hard to say. The difference is somewhat 
large, but the sample is somewhat small. 

1.3 (a) 143,196 eligible subjects who were 
born in Southern California between 1989 
and 1993. (b) The variables are measure- 
ments of CO, NO2, ozone, and particulate 
matter less than 10/im (PM10) collected at 
air-quality-monitoring stations as well as the 
birth weights of the babies. All of these vari- 
ables are continuous numerical variables, (c) 
Does air pollution exposure have an effect on 
preterm births?' 

1.5 (a) 202 black and 504 white adults who 
resided in or near New York City, were 
ages 20-94 years, and had BMIs of 18-35 
kg/m 2 . (b) Age (numerical, continuous), sex 
(categorical), ethnicity (categorical), weight, 
height, waist and hip circumference, length 
of tibia, body density and volume, total 
body water (numerical, continuous). (c) 
How useful is BMI for predicting body fat- 
ness across age, sex and ethnic groups? 



1.7 (a) A participant in the survey, (b) 
1,691 participants, (c) gender (gender of 
the participant), age (age of the partici- 
pant, in years), marital (marital status of 
the participant), grosslncome (gross income 
of the participant, in £), smoke (whether or 
not the participant smokes), amt Weekends 
(number of cigarettes smoked on weekend, 
# of cigarettes / day), amt Weekdays (num- 
ber of cigarettes smoked on a week day, # 
of cigarettes / day). 

1.9 gender (categorical), age (originally nu- 
merical, continuous, though it was recorded 
as a discrete numerical variable), marital- 
Status (categorical), grosslncome (originally 
numerical, continuous, but recorded as cate- 
gorical), smoke (categorical), amt Weekends 
(numerical, discrete), amt Weekdays (numer- 
ical, discrete). 

1.11 We would expect productivity to in- 
crease as stress increases, but up to a point, 
after that productivity would decrease as 
stress continued to increase. The exact 
shape of your plot may be a little different. 




stress 
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1.13 (a) Population mean, fi x = 5.5; sam- 
ple mean, x = 6.25. (b) Population mean, 
[i x = 52; sample mean, x = 58. 



1.15 (a) Decrease, (b) 73.6. (c) The new 
score, X25, is more than 1 standard devia- 
tion away from the previous mean, and this 
will tend to increase the standard deviation 
of the data. While possible, it is mathe- 
matically rather tedious to calculate the new 
standard deviation. 



1.17 The distribution of amount of 
cigarettes smoked on weekends and on week- 
days are both right skewed. The median 
of both distributions is between 10 and 15 
cigarettes, the first quartile is between 5 
and 10 cigarettes, and the third quartile is 
between 15 and 20 cigarettes. Hence the 
IQR of both distributions is roughly about 
10 cigarettes. There are potential outliers 
above 40 cigarettes per day, giving both dis- 
tributions a long right tail. We can also 
see that there are more respondents who 
smoke only a few cigarettes (0 to 5) on the 
weekdays, about 80 people, than on week- 
ends, about 60 people. Another feature that 
is visible from the histograms are peaks at 
10 and 20 cigarettes. This may be because 
most people do not keep track of exactly 
how many cigarettes they smoke, but round 
their answers to half a pack (10 cigarettes) 
or a whole pack (20 cigarettes). Due to these 
peaks, the distributions could be classified 
as bimodal. 



1.19 S amtW ' eekends — 0 S ^ amtW eekday s — 

4.18. Variability of the amount of cigarettes 
smoked is higher on weekdays than on the 
weekends for this sample. 



1.21 (a) 6 (b) 6.5 



1.23 Plot below. 



1.25 (a) The distribution is unimodal and 
symmetric with a mean around 60 and a 
standard deviation of roughly 3; matches the 
box plot (2). (b)The distribution is uniform 
and values range from 0 to 100; matches box 
plot (3) which shows a symmetric distribu- 
tion in this range. Also, each 25% chunk of 
the box plot have about the same width and 
there are no suspected outliers, (c) The dis- 
tribution is unimodal and right skewed with 
a median between 1 and 2. The IQR of the 
distribution is roughly 1; matches box plot 
(!)• 

1.27 (a) Since median is defined as the 50 th 
percentile and about 50% of the data is in 
the first bar, we would expect median to be 
between 0 and 20. Ql is also between 0 and 
20 as the 25*^ percentile is in the first bar as 
well. Q3, defined as the 75 th percentile, is 
located between 40 and 60. (b) The distri- 
bution is right-skewed, so the long tail will 
pull the mean above the median. 

1.29 It appears that marathon times de- 
creased greatly between 1970-1975 and re- 
mained somewhat steady thereafter. Males 
consistently had shorter marathon times 
than females throughout the years. From 
the box plots of males and females, we could 
tell that males ran faster "on average" , how- 
ever, we could not tell that the winning male 
time for each year was better than the win- 
ning female time. We also could not tell from 
the histogram or the box plot that marathon 
times have been decreasing for males and fe- 
males throughout the years. 
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1.31 (a) The distribution is right skewed 
with potential outliers on the positive end, 
therefore the median and the IQR are ap- 
propriate measures of center and spread, (b) 
The distribution is somewhat symmetric and 
probably does not have outliers, therefore 
the mean and the standard deviation are 
appropriate measures of center and spread, 
(c) The distribution would be right skewed. 
There would be some students who do not 
consume any alcohol but this is the min- 
imum (there cannot be students who con- 
sume fewer than 0 drinks). There would 
be a few students who consume many more 
drinks than their peers, giving the distribu- 
tion a long right tail. Due to the skew, the 
median and the IQR would be appropriate 
measures of center and spread, (d) The dis- 
tribution would be right skewed. Most em- 
ployees would make something on the order 
of the median salary, but we would expect 
to have some high level executives making 
a lot more. The distribution would have a 
long right tail, and the median and the IQR 
would be more more appropriate measures 
of center or spread. 

1.33 (a) As well as the order of the cate- 
gories, we can also see the relative frequen- 
cies in the bar plot. These proportions are 
not readily available in the pie chart, (b) 
None, (c) Bar plot, so that we can also see 
the relative frequencies of the categories in 
this graph. 

1.35 (a) Proportion of patients who are 
alive at the end of the study is higher in 
the treatment group than in the control 
group. Therefore survival is not indepen- 
dent of whether or not the patient got a 
transplant, (b) The shape of the distribu- 
tion of survival times in both groups is right 
skewed with outliers on the high end. The 
median survival time for the control group 
is much lower than the median survival time 
for the treatment group; patients who got a 
transplant typically lived longer. The maxi- 
mum survival time for the treatment group is 
much higher (about 5 years) than the max- 
imum survival time for the control group. 
Even though the maximum survival time for 



the control group is about 4 years, this ob- 
servation is an outlier. Overall, very few pa- 
tients without transplants made it beyond a 
year while nearly half of the transplant pa- 
tients survived at least one year. It should 
also be noted that while the first and third 
quartiles of the treatment group is higher 
than those for the control group, the IQR for 
the treatment group is much bigger, indicat- 
ing that there is more variability in survival 
times in the treatment group. 

1.37 (a) The population is all adults 20 and 
older living in the greater New York City 
area. The sample is the 202 black and 504 
white men and women who resided in or 
near New York City and had BMIs of 18-35 
kg/m 2 . (b) The population is all Californi- 
ans registered to vote in the 2010 midterm 
elections. The sample is the 1000 registered 
California voters who were surveyed for this 
study. 

1.39 (a) This is an observational study, (b) 
Wealth is one lurking variable. Countries 
with individuals who can widely afford in- 
ternet probably also can afford basic medical 
care. (Note: Answers may vary.) 

1.41 (a) Simple random sample. Non- 
response bias, if only those people who have 
strong opinions about the survey responds 
his sample may not be representative of the 
population, (b) Convenience sample. Under 
coverage bias, his sample may not be repre- 
sentative of the population since it consists 
only of his friends. 

1.43 (a) Non-responders are most likely 
parents who have busier schedules and have 
difficulty spending time with their kids after 
school, (b) The women who are not reached 
3 years later are most likely renters (as op- 
posed to homeowners) who may be in a lower 
socio-economic status, (c) There is no con- 
trol group and there may be lurking vari- 
ables. For example, it may be that these 
people who go running are generally health- 
ier and/or do other exercises. 

1.45 No, this was an observational study, 
and we cannot make such a causal statement 
based on an observational study. 
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1.47 Prepare two cups for each participant 
one containing regular Coke and the other 
containing Diet Coke. Make sure the cups 
are identical and contain equal amounts of 
soda. Label the cups A (regular) and B 
(diet). (Be sure to randomize A and B for 
each trial!) Give each participant the two 
cups, one cup at a time, in random order, 
and ask the participant to record a value 
that indicates how much she liked the bev- 
erage. Be sure that neither the participant 
nor the person handing out the cups knows 
the identity of the beverage to make this 
a double-blind experiment. (Answers may 
vary). 

1.49 (a) Experiment, (b) Treatment: exer- 
cise twice a week, control: no exercise, (c) 
Yes, the blocking variable is age. (d) No. (e) 
Since this is an experiment, we can make a 
causal statement. Since the sample is ran- 
dom, the causal statement can be general- 
ized to the population at large. However, we 
should be cautious about making a causal 
statement because of a possible placebo ef- 
fect. Note that this study could not actu- 
ally be conducted since people cannot be re- 
quired to participate in a clinical trial. 

1.51 (a) False. Instead of comparing 
counts, we should compare percentages of 
people in each group who suffered a heart 
attack, (b) True, (c) False. Association 
does not imply causation. We cannot infer 
a causal relationship based on an observa- 
tional study. (We cannot say changing the 
drug a person is on affects her risk, which is 
why part (b) is true.) (d) True. 



2 Probability 

2.1 False. The tosses are independent tri- 
als. 

2.3 (a) 10 tosses. With a low number of 
flips the variability in the number of heads 
observed is much larger, so a result further 
from 50% is more likely, (b) 100 tosses. 
With more flips, the observed proportion of 
heads would probably be closer to 50% and 
therefore above 40%. (c) 100 tosses. The 



more flips, the less variability away from 
50%. (d) 10 tosses. Fewer flips mean more 
volatility and a greater chance of getting far 
from 50% and below 30%. 

2.5 (a) 1/1024. (b) 1/1024. (c) 1023/1024. 

2.7 (a) Figure below, (b) 5% (c) 70% (d) 
95% (e) 5% (f) No, there are bloggers who 
own both types of cameras. 





DSLR 












PSS 




[ 0.05 


f 0.20 1 


D.70 J 

0.05 











2.9 (a) Not mutually exclusive. If the class 
is not graded on a curve, then independent. 
If graded on a curve, dependent, (b) Not 
mutually exclusive, most likely dependent, 
(c) No. See the answer to (a) when the 
course is not graded on a curve. 

2.11 (a) 0.26. (b) 0.23. (c) Assuming that 
the education level of the husband and wife 
are independent, 0.0598. (d) Independence, 
which may not be a reasonable assumption 
since people often marry others with a com- 
parable level of education. 

2.13 (a) Sum greater than 1. (b) OK math- 
ematically, (c) Sum less than 1. (d) Nega- 
tive probabilities make no sense, (e) OK. (f) 
Probabilities cannot be less than 0 or greater 
than 1. 

2.15 Approximate answers are OK. An- 
swers are only estimates based on the sam- 
ple, (a) 0.42. (b) 0.15. (c) 0.37. (d) 0.06. 

2.17 (a) The distribution is right skewed, 
with a median between $35,000 and $49,999. 
The IQR of the distribution is about 
$27,500. There are probably outliers on the 
high end due to the nature of the data, (b) 
62.2%. (c) Assuming gender and income are 
independent: 25.5%. (d) P(less than $50,000 
and female) = 29.4%. The independence as- 
sumption does not appear to be valid. If gen- 
der and income were independent, we would 
expect the 25.5% of the sample to be female 
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and make less than $50,000, but actually a 
higher proportion fall into this category. 

2.19 No, P(DSLR | point&shoot) = 0.22, 
which is not equal to P(DSLR). 

2.21 (a) 0.2825. (b) 0.1905. (c) 0.4167. (d) 

No, because P (black hair | brown eyes) ^ 
P (black hair | blue eyes). (Other explana- 
tions are possible.) 

2.23 (a) 0.65. (b) 0.72. (c) Under the 
assumption of independence of gender and 
hamburger preference: 0.468. While it is 
possible there is some mysterious connection 
between burger choice and finding a part- 
ner, independence is probably a reasonable 
assumption, (e) 0.514. 

2.25 Female, most cats smaller than 2.5kg 
are female. 

2.27 0.6049. 

2.29 (a) Tree diagram below, (b) 0.68. (c) 
0.32. (d) Your test results have come in. 
While the test came back positive, this is 
not conclusive. A positive test result can 
occur even when a patient has no disease; 
occasionally a test will be wrong. For this 
reason, we will need to run some additional 
tests. 

Cancer? Result 




0.005*0.95 = 0.0048 
0.005"0.05 = 0.0002 
0.995"0.01 = 0.01 
0.995*0.99 = 0.985 



2.31 (a) 0.3. (b) 0.3. (c) 0.3. (d) 0.09. (e) 
Yes, each draw is from the same set of mar- 
bles. 

2.33 (a) 0.0909. (b) 0.3182. (c) 0.4545. (d) 
0. (e) 0.2879. 

2.35 0.0519. 

2.37 (a) 13. (b) No, this would be unreli- 
able. The students are not a random sample. 

2.39 (a) Table below. Expected winnings: 
$3.59. SD: 3.37. (b) EV: -$1.41, SD: $3.37. 
(c) No. The expected net profit is negative, 
so on average you expect to lose money. 



Event 


3 hearts 


3 blacks 


Else 


X 


$50 


$25 


$0 


P(X) 


0.0129 


0.1176 


0.8695 


X * P(X) 


0.65 


2.94 


0 


(X-E(X)) 2 P(X) 


0.1115 


0.0497 


11.2062 



2.41 (a) EV: -$0.16, SD: $2.99. (b) EV: - 
$0.16, SD: $1.73. (c) Expected values are 
the same but the standard deviations are 
different. The standard deviation from the 
game where winnings and losses are tripled 
is higher, making this game riskier. 

2.43 (a) Table to the right. Expected win- 
nings: -$0.54 (b) No, he is expected to lose 
money on average. 



Event 


2,. ..,9 


J, Q, K 


Ace 


A* 


X 


-2 


1 


3 


23 


P(X) 


0.6923 


0.2308 


0.0577 


0.0192 



2.45 $4.26. 

2.47 (a) Mean: $3.90, SD: $0.34. (b) Mean: 
$27.30, SD: $0.89. 



3 Distributions of random variables 



3.1 Plots below, (a) 0.0885. (b) 0.0694. (c) 
0.5886. (d) 0.0456. 




3.3 (a) Verbal: N(fi = 462, a = 119), 
Quant: N{fi = 584, a = 151). (b) Z VR = 
1.33, Z QR = 0.57. Plots below, (c) She 
scored 1.33 standard deviations above the 
mean on the Verbal Reasoning section and 
0.57 standard deviations above the mean 
on the Quantitative Reasoning section, (d) 
Percys = 91%, Perc QK = 72%. (e) Verbal 
Reasoning, (f) VR: 9%, QR: 28%. (g) We 
cannot compare the raw scores since they are 
on different scales. Her scores will be mea- 
sured relative to the merits of other students 
on each exam, so it is helpful to consider the 
Z score. Comparing her percentiles is more 
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appropriate for determining how well she did 
compared to others. 




z = 0.57 z = 1.33 



3.5 Answers to (b) and (c) would not 
change, though we would not draw a Nor- 
mal curve on which to show these scores. We 
could not answer parts (d) and (e) since the 
normal probability table is only valid for the 
normal model. 



3.7 (a) 711. (b) 400. 

3.9 Figures below, (a) 0.1210. (b) 0.1558. 
(c) 62.68 inches, (d) 43.25% 




3.11 (a) 0.1401. (b) 70.6°F or colder. 

3.13 (a) 0.67. Using 0.68 is also okay, but 
your answers for part (c) will differ a lit- 
tle from the listed solution, (b) x = $1800, 
(i = $1650. (c) a = $223.88. 

3.15 (a) 0.2327. Figure below, (b) If you 
are bidding on only one auction and set a 
maximum bid price that is too low, chances 
are someone will outbid you and you won't 
win the auction. If your maximum bid price 
is too high, you may win the auction but 
you may be paying more than is necessary. 
If you are bidding on more than one auc- 
tion and your maximum bid price is too low, 
chances are you won't win any of the auc- 
tions. However, if your maximum bid price 
is too high, you may win more than one auc- 
tion and end up with multiple copies of the 
book, (c) An answer roughly equal to the 
10 th percentile would be reasonable. Regret- 
tably, no percentile cutoff point guarantees 



beyond any possible event that you win at 
least one auction. However, you may pick 
a higher percentile if you want to be more 
sure of winning an auction, (d) Using the 
10 th percentile: $69.80. Answers may vary 
but should correspond to the answer given 
in part (c). 



0.2327\ 
89 100 

3.17 70% of the data are within 1 SD, 95% 
are within 2 SD, and 100% are within 3 SD 
of the mean. The data approximately follow 
the 68-95-99.7% Rule. 

3.19 The distribution is unimodal, sym- 
metric, and approximately follows the 68- 
95-99.7% Rule. The superimposed normal 
curve seems to approximate the distribution 
pretty well. The points on the Normal prob- 
ability plot also seem to follow a straight 
line. There is one possible outlier on the 
lower end that is apparent in both graphs, 
but it is not too extreme. We can say that 
the distribution is nearly normal. 

3.21 No, in poker cards are dealt without 
replacement, and they have more than two 
categories. 

3.23 Approximate answers are OK. (a) 

0. 13. (b) 0.12. (c) n = 2.04, a = 1.46. (d) 
H = 3.33, a = 2.79. (e) When p was smaller, 

1. e. the event was rarer, the expected num- 
ber of trials before a success and the stan- 
dard deviation increased. 

3.25 (a) 0.096. (b) fj, = 8, a = 7.48. 

3.27 (a) Yes, it meets the four required con- 
ditions, (b) 0.203. (c) 0.203. (d) 0.167. (e) 
0.997. 

3.29 (a) fx = 34.85, a = 3.25. (b) Yes, since 
45 is more than 3 standard deviations from 
the mean, (c) 0.0015 (an answer of 0.0009 
would be okay if using a normal approxima- 
tion, since the conditions for the approxi- 
mation are satisfied). In part (b), we had 
determined that it would be unusual to ob- 
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serve 45 or more 18-20 year olds who have 
consumed alcoholic beverages among a ran- 
dom sample of 50, and the we calculated a 
very low probability for this event. 

3.31 (a) 0.5160. (b) 0.1234. (c) 0.8483. (d) 
No, otherwise there is a 15.17% chance that 
2 or more will be afraid of spiders in any 
particular tent. 

3.33 (a) 0.109. (b) 0.219. (c) 0.137. 
(d) 0.551. (e) 0.084. (f) Since 2 is 
= —1.89 standard deviations below the 
expected number of brown eyed children, 
strictly speaking this would not be consid- 
ered unusual. However, it should be noted 
that the z-score for this value is pretty close 
to 2, making this observation borderline un- 
usual. 

3.35 The probability model is below. 
~"F ^3 H i 3 

P(Y) 0.1458 0.3936 0.3543 0.1063 

3.37 (a) (1/5)*(1/4)*(1/3)*(1/2)*(1/1) = 
1/(5!) = 1/120. (e) 120 = 5!. (c) 8! = 
40,320. 

3.39 (a) 0.0804 using the geometric distri- 
bution, (b) 0.0322 using the binomial distri- 
bution, (c) 0.0193 using the negative bino- 
mial distribution. 

3.41 (a) Negative Binomial (n = 4, p = 
0.55): Of the four trials considered here, 
the last trial must be a success and there 
were exactly 2 successes, (b) 0.1838. (c) 
(?) = 2TIT = 3- (d) In the binomial model 
we have no restrictions on the outcome of 
the last trial while in the negative binomial 
model the last trial is fixed. Therefore we 
are interested in the number of ways of or- 
derings of the other k — 1 successes in the 
first n — 1 trials. 

3.43 (a) Poisson with A = 75. (b) fj, = A = 
75, a = VA = 8.66. (c) No, since 60 is within 
2 standard deviations of the mean. 

3.45 P(X = 70) = 75 ™g," 75 = 0.0402 



4 Foundations for inference 

4.1 (a) Mean, (b) Mean, (c) Proportion, 
(d) Mean, (e) Proportion. 



4.3 The point estimates are the correspond- 
ing sample values, (a) x = 13.65, median= 

14. (b) S = 1.91, IQResUmate = 2. (c) 

Use the Z score to evaluate (Z w = 1.23, 
Zi8 = 2.28), so 18 credits is unusually high 
but 16 is not, where we use 2 standard devi- 
ations from the mean as a cutoff for deciding 
what is unusual. 

4.5 No, sample point estimates only ap- 
proximate the population parameter, and 
they vary from one sample to another. 

4.7 Standard error, SE S = j^L = 0.191. 

4.9 (a) SE S = 2.89 (b) The Z score is 1.73 
(absolute value is less than 2), so $80 is con- 
sistent. 

4.11 (a) Independence is met by the ran- 
dom sampling assumption and that the sam- 
ple is less than 10% of the population. The 
sample size is also sufficiently large. We can- 
not check the assumption that the distribu- 
tion isn't extremely skewed, (b) (19.862, 
20.058). (c) We are 90% confident that 
the true mean amount of coffee in Star- 
bucks venti cups is between 19.862 ounces 
and 20.058 ounces, (d) 90% of random sam- 
ples of size 50 will yield confidence intervals 
that capture the true mean amount of coffee 
in Starbucks venti cups, (e) Yes, 20 ounces 
is included in the interval, (f) A 95% confi- 
dence interval would be wider. All else kept 
constant, when confidence level increases so 
does the margin of error and hence the inter- 
val becomes wider. We cast a wider interval. 

4.13 (a) Less, (b) We can infer from the 
sample statistics that the distribution is 
skewed, so no we cannot, (c) The only con- 
dition that may not be met for normality 
of the mean relates to skew: it is unclear if 
the distribution is extremely skewed or not. 
We'll suppose the skew is strong but not too 
extreme, something we may like to look into 
further. Solution: 0.0985. (d) Decreases the 
standard error by a factor \[2. 

4.15 When the confidence level increases, so 
does the margin of error and the width of the 
interval. A wide interval may be undesirable 
even if the confidence level is higher. 
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4.17 (a) False, since we need only check 
whether the skew is not too extreme, (b) 
False, we are 100% sure the average for these 
patients is in this interval, (c) True, (d) 
False, the confidence interval is not about 
sample means, (e) False, as the confidence 
level increases, so does the width of the inter- 
val, (f) True, (g) False, since in calculation 
of the standard error we divide the standard 
deviation by square root of the sample size, 
we would need to quadruple the sample size. 

4.19 (a) 0.0004. (b) Since the sample is 
random and the 10% condition is met, we 
can assume the that how much one penny 
weighs is independent of another. Since the 
population distribution is normal, and hence 
not extremely skewed, sampling distribution 
of means will be nearly normal even though 
n < 50. N(p = 2.5, a s = 0.0095). (c) Ap- 
proximately 0. (d) Plot below, (e) The sam- 
ple or sampling distributions would not be 
approximately normal. 



Population 

Sampling (n = 10) 




2.41 2.44 2.47 2.50 2.53 2.56 2.59 



4.21 (a) From the histogram: P(X > 5) = 

350+100+25+20+5 _ _500_ _ n 17 T+'« nlrav 
3000 ~ 3000 _ U - i '- LZS OKa y 

if your answer differs a little, (b) Two dif- 
ferent answers are reasonable. 1) The con- 
ditions are reasonably met. We know the 
population standard deviation, so we can 
know the standard error (SD of x) with cer- 
tainty. The population distribution is also 
only slightly skewed, so a sample of 15 would 
probably have a sampling distribution for 
the mean that is nearly normal. Solution: 
0.0956. 2) If you had said the normality 
condition for x was questionable because the 
population distribution was not normal, that 
is also an acceptable answer, (c) Assump- 
tions/conditions are certainly met. Solution: 
0.1788. 



4.23 (a) H 0 : fi = 8 (On average New York- 
ers sleep 8 hrs a night), Ha- H < 8 (On 
average New Yorkers sleep less than 8 hrs 
a night), (b) H a : [i = 15 (The average 
amount of company time spent not working 
is 15 minutes), Ha- ^ > 15 (The average 
amount of company time spent not working 
is greater than 15 minutes). 

4.25 The hypotheses should be about the 
population mean (p), not the sample mean. 
If he believes that $1.3 million is an overes- 
timation, the alternative hypothesis should 
be less than and not greater than. The 
correct way to set up these hypotheses is 
as follows: H 0 : [i = $1.3 million, Ha'- 
fi < $1.3 million. 

4.27 (a) 180 minutes is not in the inter- 
val, so this is implausible, (b) 2.2 hours 
(132 minutes) is in the interval, so we con- 
clude the estimated wait time of 2.2 hours 
is reasonable, (c) A 99% confidence interval 
will be wider than a 95% confidence interval. 
Hence even without calculating the interval 
we can tell that 132 minutes would be in it. 

4.29 (a) Hq: Anti-depressants do not work 
for the treatment of Fibromyalgia. Ha- 
Anti-depressants work for the treatment of 
Fibromyalgia. (b) Concluding that anti- 
depressants work for the treatment of Fi- 
bromyalgia when they actually do not. (c) 
Concluding that anti-depressants do not 
work for the treatment of Fibromyalgia when 
they actually do. (d) If she makes a Type 
I error, she will continue taking medication 
that does not actually treat her disorder. If 
she makes a Type II error, she will stop tak- 
ing medication that could treat her disorder. 

4.31 (a) Yes, if we assume there isn't too 
much skew, which is certainly reasonable 
once we realize the possible percentages, 
which must be between 0% and 100%, are 
bounded within 3 standard deviations from 
the mean, (b) H 0 : fi = 0.25, H A : A* ¥= 0.25. 
Z = —7.71 — > two-sided p-value « 0. Reject 
H n : the evidence indicates that the percent- 
age of time college students spend on the in- 
ternet for coursework has changed over the 
last decade, (c) If the percentage of time 
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college students spend on the Internet for 
course work has actually remained at 25%, 
the probability of getting a random sample 
of 238 college students where the average 
percentage of time they spend on the Inter- 
net for course work is 10% or less or 40% or 
more is approximately 0. (d) Type I, since 
we may have incorrectly rejected Hq. 

4.33 H 0 : [i = 7, H A : \i ^ 7. Z = 
-1.04 -^single tail= 0.1492 -> p-value = 
2 * 0.1492 = 0.2984. There isn't sufficient 
evidence that the average lifespan of all ball 
bearings produced by this machine is not 7 
hours. The manufacturer's claim is not im- 
plausible. 

4.35 x = 36.05. 

4.37 (a) The distribution is unimodal and 
right skewed, median is between 5 and 10 
years old, and the IQR is roughly 10. There 
are potential outliers on the higher end. (b) 
When the sample size is small, the sam- 
pling distribution is right skewed, just like 
the population distribution. As the sam- 
ple size increases, the sampling distribution 
gets more and more unimodal and symmet- 
ric, just like the CLT suggests. 

4.39 (a) If the skew is not too strong, the 
assumptions are met. (b) Hq: \i = 432, H A : 
^ < 432. Z = -3.28 -> p-value (single tail) 
= 0.0005. Since the p-value < a, we re- 
ject Hq. There is evidence that the average 
amount savings of all customers who switch 
their insurance is less than $432. (c) Yes, 
the insurance company's claim may be an 
overestimate since the hypothesis test result 
indicated there was strong evidence that the 
average savings is less than the advertised 
amount, (d) ($376.47, $413.53). (e) Yes, the 
hypothesis test was statistically significant 
and $432 was not in the confidence interval. 

4.41 (a) The only condition we cannot 
check is for extreme skew. Here, we will 
assume this is not an issue; in practice, 
this is something we should verify, (b) Hq: 
[i = 500, H A : n ^ 500. Z = -3.86 -single 
tail« 0 —¥ p-value w 2 * 0 = 0. Since the 
p-value < a (0.05), we reject Hq. The data 
provide strong evidence that the average in- 



crease in reading speed is not 500% (it is 
below 500% based on the data), (c) No, the 
company's claim of an average of 500% in- 
crease in reading speed does not appear to 
be accurate, (d) 371.88% to 458.12%. (e) 
Yes. The hypothesis test rejected that the 
average increase was 500%, and 500% was 
not in the confidence interval. 

4.43 n > 693. 



5 Large sample inference 

5.1 (a) Hypothesis test for paired data, (b) 
Hq: Hdiff = 0 (There is no difference in aver- 
age daily high temperature between January 
1, 1968 and January 1, 2008), H A : [i difs > 0 
(Average daily high temperature in January 
1, 1968 was lower than average daily high 
temperature in January, 2008.) (c) Indepen- 
dence is satisfied since we have a random 
sample that is less than 10% of the possi- 
ble locations we could collect such measure- 
ments in the continental U.S. There is also a 
one-to-one correspondence between the ob- 
servations in the data set, making it appro- 
priate for a paired analysis. The sample size 
is sufficiently large (n = 51). If we had the 
data in hand, we would also check for ex- 
treme skew. Z = 1.60, p-value = 0.0548. (d) 
Fail to reject Hq. The data do not provide 
strong evidence of temperature warming in 
the continental US. However, it should be 
noted that the p-value is very close to 0.05. 
(e) Type II. If we made such an error and 
concluded that there isn't strong evidence 
for temperature warming in the continental 
US, but in reality average temperature on 
January 1, 2008 is higher than average tem- 
perature on January 1, 1968. (f) Yes. 

5.3 (a) Hq: [1b = I^A, another way to write 
this is — HA = 0 (The population mean 
of number of cigarettes smoked per day did 
not change after the Surgeon General's re- 
port), H A : /ii > ji2, another way to write 
this is fii — /i 2 > 0 (The population mean 
of number of cigarettes smoked per day de- 
creased after the Surgeon General's report) 
(b) Independence is satisfied since we have 
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two random samples that are less than 10% 
of their respective populations. The sam- 
ple sizes are sufficiently large. If we had the 
data in hand, we would also check for ex- 
treme skew. Z = 1.89, p- value = 0.0294. 
(c) Reject H 0 - There is sufficient evidence 
that the number of cigarettes smoked per 
day decreased after the Office of the Sur- 
geon General's report, (d) No, we cannot 
make a causal connection because this is ob- 
servational data, (e) Type I, since we may 
have incorrectly rejected H 0 . 

5.5 Independence is satisfied since we have 
two independent random samples that are 
each less than 10% of the population. The 
sample sizes are sufficiently large. If we had 
the data in hand, we would also check for 
extreme skew. We are 90% confident that 
the average score in 2004 was 0.16 to 5.84 
points lower than the average score in 2008. 

5.7 H 0 : [im = Hw, Ha'- Mm < ^w- Z = 
-97.35, p-value ss 0. Reject H 0 . The data 
provide strong evidence that average body 
fat percentage for women is higher. 

5.9 (a) False, for proportions need to check 
success/failure condition, not n > 50. (b) 
True, (c) False, only 1.65 standard errors 
away from the mean, and we use 2 as a cut- 
off for what is called unusual, (d) True, (e) 
False, standard error would decrease only by 
a factor of \pi. 

5.11 (a) True, (b) False, standard error 
would decrease only by a factor of \pl. (c) 
True, (d) True, (e) False, success/failure 
condition is not satisfied. 

5.13 (a) Paremeter: proportion of all grad- 
uates from this university who found a job 
within one year of graduating. Point esti- 
mate: 0.87. (b) Independence is satisfied 
since we have a random sample that is less 
than 10% of the population. Normality is 
satisfied since the success-failure condition 
is met. CI: (0.837 , 0.903). (c) We are 
95% confident that the true proportion of 
graduates from this university who found a 
job within one year of completing their un- 
dergraduate degree is between 83.7% and 
90.3%. (d) 95% of random samples of 400 



would produce a confidence interval that in- 
cludes the true proportion of students at this 
university who found a job within one year 
of graduating from college, (e) It would be 
wider, (f) It would be narrower. 

5.15 (a) She needs a minimum of 3,394 sub- 
jects and therefore needs to set aside a min- 
imum of $67,880. (b) It will be wider. 

5.17 (a) ME = 1.96*^/0.66 * 0.34/1, 1018 « 
0.03. (b) No, for two reasons. The point es- 
timate is slightly below 67%, and 67% is 
contained in the interval. 

5.19 (a) H 0 : p = 0.5, H A : p > 0.5. As- 
suming the sample is random. The sam- 
ple is simple random and from <10% of the 
population, so independence is reasonable. 
The success/failure condition is also met. 
Z = 4.66, p-value w 0. Since the p-value 
is small, we reject Hq. The data provide 
strong evidence that majority of the Amer- 
icans think the Civil War is still relevant, 
(b) If in fact only 50% of Americans thought 
the Civil War is still relevant, the probabil- 
ity of obtaining a random sample of 1,507 
Americans where 56% think it is still rele- 
vant would be approximately 0. (c) We are 
90% confident that 54% to 58% of all Ameri- 
cans think that the Civil War is still relevant. 
This agrees with the conclusion of the earlier 
hypothesis test since the interval lies above 
50%. 

5.21 (a) H 0 : p = 0.5, H A : p < 0.5. The as- 
sumptions and conditions are satisfied. Z = 
-0.73, p-value = 0.2327. Since the p-value is 
large, we fail to reject Hq. The data do not 
provide strong evidence that less than half 
of American adults who decide to not go to 
college make this decision because they can- 
not afford college, (b) Yes, since we failed to 
reject H 0 . 

5.23 (a) The assumptions and conditions 
are satisfied. We are 80% confident that the 
44.5% to 51.5% of all Americans who decide 
not to go to college do so because they can- 
not afford it. This agrees with the conclu- 
sion of the earlier hypothesis test since the 
interval includes 50%. (b) 1,818. 
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5.25 (a) H 0 : p = 0.3, H A : p > 0.3. 
The assumptions and conditions are satis- 
fied. Z = 1.89, p- value = 0.0294. Since the 
p- value small, we reject H 0 . The data pro- 
vide strong evidence that the rate of sleep 
deprivation for New Yorkers is higher than 
the rate of sleep deprivation in the popu- 
lation at large, (b) If in fact 30% of New 
Yorkers were sleep deprived, the probability 
of getting a random sample of 300 New York- 
ers where more than 105 are sleep deprived 
would be 0.0294. 

5.27 (a) H 0 : p = 0.18, H A : p ^ 0.18. 
The assumptions and conditions are satis- 
fied. Z = 0.74, p-value = 0.4592. Since the 
p- value is large, we fail to reject Hq. The 
data do not provide strong evidence that the 
percentage of students at this university who 
smoke has changed over the last five years, 
(b) Type II, since we may have incorrectly 
failed to reject Hq. 

5.29 (a) Hq: p = 0.65, H A : p > 0.65. As- 
suming that the 250 < 10% of high school 
graduates at this school district, all condi- 
tions and assumptions are satisfied. Z = 
1.26, p-value = 0.1038. Since the p-value is 
large, we fail to reject Hq. The data do not 
provide strong evidence that the percentage 
of students in this rural school district who 
go out of state for college has increased, (b) 
If in fact 65% of students in this school dis- 
trict went out of state for college, the prob- 
ability of getting a random sample of 250 
students where 172 or more of them go out 
of state for college would be 0.1038. 

5.31 (a) The assumptions and conditions 
are satisfied. 95% CI: (0.138, 0.270). (b) 
We are 95% confident that the proportion of 
students from the rural school district who 
plan to go out of state for college is 13.8% to 
27% higher than the proportion of students 
from the urban school district who do. 

5.33 (a) H 0 : p D = p u H A : p D > pj. 
The assumptions and conditions are satis- 
fied. Z = 11.29, p-value w 0. Since the p- 
value is very small, we reject Hq. The data 
provide strong evidence that the proportion 
of Democrats who support the plan is higher 



than the proportion of Independents who 
support the plan, (b) Type I, since we may 
have incorrectly rejected Hq. (c) No, reject- 
ing the null hypothesis of p\ = p 2 is equiva- 
lent to rejecting that P\—p2 = 0. Therefore 
we would not expect a confidence interval 
for the difference between the two propor- 
tions to include 0. (d) We are 95% confi- 
dent that the proportion of Democrats who 
support the plan is 23% to 33% higher than 
the proportion of Independents who do. (e) 
True. 

5.35 The assumptions and conditions are 
satisfied. We are 95% confident that the 
proportion of Californians who are sleep de- 
prived is 1.7% less to 0.1% more than the 
proportion of Oregonians who are sleep de- 
prived. Since the confidence interval in- 
cludes 0, we would not reject a null hy- 
pothesis that the two population proportions 
equal to each other. 

5.37 (a) True, (b) False, the interval only 
estimates the difference in population pa- 
rameters, (c) False, to get the 95% confi- 
dence interval for (p p i acebo - p me dication) , all 
we have to do is to swap the bounds of the 
original confidence interval and take their 
negatives, (d) True, (e) False, the confi- 
dence interval for the difference between the 
proportions of success includes 0, so we can- 
not reject the hypothesis of no difference. 

5.39 (a) College grads: 35.2%. Non-grads: 
33.9%. (b) H 0 : p CG = p NC G, H A : p C o + 
Pncg- The assumptions and conditions are 
satisfied. Z = 0.37, p-value = 0.7114. Since 
the p-value is large, we fail to reject Hq. 
The data do not provide strong evidence 
of a difference between the proportions of 
college graduates and non-college graduates 
who support off-shore drilling in California. 

5.41 (a) We are 90% confident that the pro- 
portion of Republicans who support the use 
of full-body scans at airports is 3% lower to 
7% higher than the proportion of Democrats 
who do. (b) No, this does not prove it; 
though the data does not provide strong ev- 
idence to the contrary. 
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5.43 (a) Hq: The distribution of the for- 
mat of the book used by the students fol- 
lows the professor's predictions. Ha- The 
distribution of the format of the book used 
by the students does not follow the profes- 
sor's predictions, (b) E hard copy = 75.6, 
E P rint = 31.5, E onUne = 18.9. (c) We are 
not told explicitly that the sample is ran- 
dom, however, we have no reason to be- 
lieve that this class is not representative of 
all introductory statistics students. We can 
safely assume that 126 < 10% of all intro- 
ductory statistics students. We may think 
it is reasonable to suppose the students are 
independent. However, the professor prob- 
ably should have included a question ask- 
ing whether the student decisions relied on 
any other students' decisions when they pur- 
chased, printed, or read the book online. All 
expected counts are at least 10. Format of 
the book used is a categorical variable, (d) 
X 2 = 2.32, df = 2, p-value > 0.3. (e) Since 
the p-value is large, we reject H 0 - The data 
do not provide strong evidence indicating 
the professor's predictions were statistically 
inaccurate. 

5.45 (a) 47.5. (b) 296.6. (c) 21.0. 

5.47 (a) H a : There is no difference in 
the rates of autism of children of mothers 
who did and did not use prenatal vitamins 
during the first three months before preg- 
nancy. Ha- There is some difference in 
the rates of autism of children of mothers 
who did and did not use prenatal vitamins 
during the first three months before preg- 
nancy, (b) E row i^ co i i = 95.2, E row i iCO ; 2 = 
85.8, E row 2, col l = 158.8, E row 2, col 2 = 
143.2. The assumptions and conditions are 
satisfied. X 2 = 8.85, df = 1,0.001 < p-value 
< 0.005. Since the p-value is small, we reject 
Hq. There is strong evidence a difference in 
the rates of autism of children of mothers 
who did and did not use prenatal vitamins 
during the first three months before preg- 
nancy, (c) The title of this newspaper article 
makes it sound like using prenatal vitamins 
can prevent autism, which is a causal state- 
ment. Since this is an observational study, 
we cannot make causal statements based on 



the findings of the study. A more accurate 
title would be "Mothers who use prenatal vi- 
tamins before pregnancy are found to have 
children with a lower rate of autism" . 

5.49 H 0 : The opinion of college grads and 
non-grads is not different on the topic of 
drilling for oil and natural gas off the coast 
of California. Ha'- Opinions regarding the 
drilling for oil and natural gas off the coast of 
California has an association with college ed- 
ucation. E row \ tCO i 1 = 151.5, E row i yCO i 2 = 
134.5, E row 2,coi 1 = 162.1, E row 2 jCO / 2 = 
143.9, E row 3 tCO i 1 = 124.5, E row 3 co / 2 = 
110.5. The assumptions and conditions are 
satisfied. X 2 = H-46, df = 2,0.001 < p- 
value < 0.005. Since the p-value is small, 
we reject Hq. There is strong evidence that 
there is some difference in rate of support for 
drilling for oil and natural gas off the Coast 
of California based on whether or not the 
respondent graduated from college. Support 
for off-shore drilling and having graduated 
from college do not appear to be indepen- 
dent. 



6 Small sample inference 

6.1 (a) t* 41 = 1.68 (b) t* 20 = 2.53 (c) t* 28 = 
2.05 (d) t*^ = 3.11 

6.3 With a larger critical value, the confi- 
dence interval ends up being wider. 

6.5 (a) H 0 : H = 8 (New Yorkers sleep 8 
hrs per night on average.), Ha- fJ- < 8 
(New Yorkers sleep less than 8 hrs per night 
on average.) (b) Independence is satisfied 
since the sample is random and less than 
10% of the population. The distribution 
doesn't appear to be strongly skewed. T = 
— 1.75, df = 24. (c) If in fact the true pop- 
ulation mean of the amount New Yorkers 
sleep per night was 8 hours, the probabil- 
ity of getting a random sample of 25 New 
Yorkers where the average amount of sleep 
is 7.73 hrs per night or less is between 0.025 
and 0.05. (d) Reject Hq, the data provide 
strong evidence that New Yorkers sleep less 
than 8 hours per night on average, (e) No. 
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6.7 (a) We are 90% confident that New 
Yorkers on average sleep 7.47 to 7.99 hours 
per night, (b) Yes. 

6.9 (a) H 0 : fj, = 1900, H A : (i ^ 1900. Inde- 
pendence assumption is met since the sam- 
ple is random and less than 10% of the pop- 
ulation. We are told to assume normality. 
T = -1.66, df = 29,0.10 < p-value < 0.20. 
Since the p-value > 0.05, we fail to reject Hq. 
The data do not provide strong evidence of 
a change in the average calorie intake of din- 
ers at this restaurant, (b) We are 95% confi- 
dent that diners at this restaurant consume 
an average of 1690 calories to 1922 calories 
per meal, (c) Yes. 

6.11 x = 56.91. 

6.13 No, distributions are extremely 
skewed. 

6.15 (a) p-value < 0.005, we reject Hq. (b) 
p-value is about 0.01, we reject Hq. (c) 
0.025 <p-value< 0.05, we reject Hq. (d) p- 
value > 0.20, we fail to reject Hq. 

6.17 (a) We are 95% confident that those 
in the group that got the weight loss pill lost 
0.92 lbs less to 4.92 lbs more than those in 
the placebo group, (c) No. (d) No. 

6.19 (a) Chicken that were fed linseed on 
average weigh 218.75 grams while those 
that were given horsebean weigh on aver- 
age 160.20 grams. Both distributions are 
relatively symmetric with no apparent out- 
liers. There is more variability in the weights 
of chicken that were given linseed, (b) Hq: 
Ml = f^H, Ha- Hl fiH- Independence is 
satisfied since both samples are random and 
less than 10% of their prospective popula- 
tions. The distributions do not appear to 
be extremely skewed and the samples are in- 
dependent of each other. T = 3.02, df = 
10,0.01 < p-value < 0.02. Reject Hq, the 
data provide strong evidence of a difference 
between the average weights of chicken that 
were fed linseed and horsebean. (c) Type I, 
since we may have incorrectly rejected Hq. 
(d) Yes. 

6.21 (a) H 0 : \i A = Mm, H a : fi A ^ fj, M . 
T = 5.46, df = 25, p-value < 0.01. Re- 
ject H 0 , the data provide strong evidence 



that there is a difference in the average city 
mileage between cars with automatic and 
manual transmissions. 

6.23 We are 95% confident that on the 
highway cars with manual transmission get 
on average 5.53 to 10.33 MPG more than 
cars with automatic transmission. 

6.25 H 0 : ht = nc, H A : ^ T ^ Mc- T = 
2.69, df = 21,0.01 < p-value < 0.02. Since 
the p-value < 0.05, we reject Hq. The data 
provide strong evidence that the amount of 
biscuits consumed by the patients in the 
treatment and control groups are different. 

6.27 (a) Hq: p = 0.69, H A : p ^ 0.69. (b) 
p = 0.57. (c) The success-failure condition is 
not satisfied, (d) Each student can be repre- 
sented with a card. Take 100 cards, 69 black 
cards representing those who follow the news 
about Egypt and 31 red cards representing 
those who do not. Shuffle the cards and draw 
with replacement (shuffling each time in be- 
tween draws) 30 cards representing the 30 
high school students. Calculate the propor- 
tion of black cards in this sample, p S im, i.e. 
the proportion of those who follow the news. 
Repeat 10,000 times and plot the resulting 
sample proportions. The p-value will be two 
times the proportion of simulations where 
Psim < 0.57. (Note: answers may vary, and 
in practice we would use a compute to sim- 
ulate.) (e) p-value w 0.27 (Note: answers 
may vary a little.) Fail to reject Hq. The 
data do not provide strong evidence that the 
proportion of high school students who fol- 
lowed the news about Egypt is different than 
the proportion of American adults who did. 

6.29 (a) Hq: p P = p c , H A : p P ^ p c . (b) 
-0.35. (c) Doubling the one tail, the p-value 
is about 0.03. Reject Hq. The data provide 
strong evidence that people react differently 
under the two scenarios. 



7 Introduction to linear regression 

7.1 (a) The relationship is linear therefore 
the residuals plot will show randomly dis- 
tributed residuals around 0 with constant 
variance, (b) The scatterplot shows a fan 
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shape, with higher variability in y for lower 
x. Therefore the residuals plot will also show 
a fan shape, wider around lower x, narrower 
around higher x. There may also be char- 
acteristics indicating nonlinearity for points 
on the left. 

7.3 (2) and (5) show a strong correlation. 
Even though (1) and (4) show a strong asso- 
ciation, the relationship is not linear there- 
fore correlation would not be strong. (3) 
and (6) show very weak or no relationship. 
Answers may vary slightly, e.g. one persons 
moderate may be equivalent to another per- 
sons strong. 

7.5 (a) Exam 2, since the points cluster 
closer to the line in the second scatterplot. 
(b) Exam 2 and the final are relatively close 
to each other chronologically, or Exam 2 may 
be cumulative so has greater similarities in 
material to the final exam. 

7.7 (a) 4. (b) 3. (c) 1. (d) 2. 

7.9 (a) The relationship is positive, weak, 
and possibly linear. There appears to be 
one outlier, a student who is about 63 inches 
tall whose fastest speed is 0 mph. This is 
probably a student who doesn't drive, (b) 
There is no obvious explanation why sim- 
ply being tall should lead a person to drive 
faster. However, one possible outside factor 
may be gender. Males tend to be taller than 
females on average, and and personal expe- 
riences (anecdotal) may suggest they drive 
faster (confirmed in sociological studies), (c) 
It appears that males are taller on aver- 
age than females and they also drive faster. 
The gender variable is a lurking variable for 
the positive association we observe between 
fastest driving speed and height. 

7.11 (a) There is a somewhat weak, pos- 
itive, possibly linear relationship between 
the distance traveled and travel time, (b) 
Changing the units will not change the form, 
direction or strength of the relationship be- 
tween the two variables, (c) Since changing 
units doesn't affect correlation, R = 0.636. 

7.13 (a) There is a moderately strong, pos- 
itive, linear relationship between shoulder 
girth and height, (b) Changing the units, 



even if just for one of the variables, will not 
change the form, direction or strength of the 
relationship between the two variables. 

7.15 (a) R = 1. (b) R = 1. (c) R=l. 

7.17 (a) There is a positive, very strong, lin- 
ear association between number of tourists 
and spending, (b) Explanatory: number of 
tourists (in thousands), response: spending 
(in million $). (c) We can predict spend- 
ing for a given number of tourists using a 
regression line. This may be useful informa- 
tion for determining how much the country 
may want to spend in advertising abroad, or 
to forecast expected revenues from tourism. 

7.19 Even though the relationship appears 
linear in the scatterplot, the residuals plot 
actually shows a non-linear relationship, 
therefore we should not fit a least squares 
line to these data. 

7.21 (a) travel time = 5l+0.726*distance. 
(b) b±: For each additional mile in distance, 
the model predicts an additional 0.726 min- 
utes in travel time. b 0 : When the distance 
traveled is 0 miles, the travel time is ex- 
pected to be 51 minutes. It does not make 
sense to have a travel distance of 0 miles. 
Here, the y-intercept serves only to adjust 
the height of the line and is meaningless by 
itself, (c) 126 minutes, (d) 42 minutes, un- 
derestimate, (e) No, extrapolation. 

7.23 Approximately 40% of the variability 
in travel time is accounted for by the model, 
i.e. explained by distance traveled. 

7.25 No, there is an outlier that appears 
to have substantial pull on the line. We'll 
see more on this topic in the next section. 
The residuals does not show a random scat- 
ter around 0, which further suggests that a 
linear model may not be appropriate. 

7.27 (a) Influential, (b) Leverage, (c) Nei- 
ther influential nor leverage. 

7.29 Neither influential nor high leverage. 

7.31 (a) The relationship appears to be 
strong, positive and linear. There is one po- 
tential outlier, the student who had 9 cans of 
beer, (b) BAG = -0.0127 + 0.0180 * beers. 
&i: For each additional can of beer con- 
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sumed, the model predicts an additional 
0.0180 grams per deciliter BAC. b 0 : Stu- 
dents who don't have any beer are expected 
to have a blood alcohol content of -0.0127. 
It is not possible to have a negative blood 
alcohol content. Here, the y-intercept serves 
only to adjust the height of the line and is 
meaningless by itself, (c) H 0 : ft = 0, H A : 
Pi > 0. p- value sa 0. Reject Hq. Number 
of cans of beer consumed and blood alco- 
hol content are positively correlated and the 
true slope parameter is indeed greater than 
0. (d) Approximately 79% of the variability 
in blood alcohol content can be explained by 
number of cans of beer consumed. 

7.33 (a) H 0 : Pi = 0, H A : Pi ^ 0. T = 
35.25, df = 168, p-value w 0. Reject 
H 0 . Wives' and husbands' ages are corre- 
lated and the true slope parameter is indeed 
greater than 0. (b) ageWife = 1.5740 + 
0.9112 * ageHusband. (c) 6 X : For each addi- 
tional year in husband's age, the model pre- 
dicts an additional 0.9112 years in wife's age. 
6 0 : Men who are 0 years old are expected to 
have wives who are on average 1.5740 years 
old. The intercept here is meaningless and 
serves only to adjust the height of the line. 

7.35 (a) R = 0.94. The slope is positive, so 
R must also be positive, (b) 51.69, since R 2 
is high, the prediction based on this regres- 
sion model is reliable, (c) No, extrapolation. 

7.37 (a) R = —0.53. The slope is negative, 
so R must also be negative, (b) H 0 : Pi = 0, 
H A : Pi ^ 0. T = 4.32, df = n - 2 = 49, 
p-value sa 0.0001. Reject Hq. Percent home- 
ownership and percent of the population liv- 
ing in an urban setting are correlated and 
the true slope parameter is indeed greater 
than 0. (c) The calculations and plotted 
line are not shown. The regression line does 
not adequately fit these data, (d) There 
is a fan shaped pattern apparent in this 
plot, which indicates non-constant variabil- 
ity in the residuals (little variability when x 
is small, more variability when x is large). 
Since the residuals have changing variabil- 
ity as we move across the plot, we should 
seek more appropriate statistical methods if 
we want to obtain a reliable estimate of the 



best fitting straight line. 



8 Multiple regression and ANOVA 

8.1 (a) weight = 248.64 + 74.94 * casein. 

(b) The estimated mean weight of chicks 
who are on casein feed is 74.94 grams higher 
than those who are given other feeds. Ca- 
sein: 323.58 grams, No casein: 248.64 grams. 

(c) Hq: The true coefficient for casein is 
zero (Pi = 0). Ha- The true coefficient for 
casein is not zero (Pi ^ 0). T = 3.23, and 
the p-value is approximately 0.0019. With 
such a low p-value, we reject H 0 . The data 
provide strong evidence that the true slope 
parameter is different than 0, and hence 
there appears to be a statistically signifi- 
cant relationship between feed type (casein 
or other) and the average weight of chicks. 

8.3 (a) bwt = 123.05 - 8.94 * smoke, (b) 
The estimated body weight of babies born to 
smoking mothers is 8.94 ounces lower than 
those who are born to non-smoking moth- 
ers. Smoker: 114.11 ounces, Non-smoker: 
123.05 ounces, (c) H a : The true coefficient 
for smoke is zero (Pi = 0). H A ' The true 
coefficient for smoke is not zero (Pi ^ 0). 
T = —8.65, and the p-value is approximately 
0. Since p-value is very small we reject 
Hq. The data provide strong evidence that 
the true slope parameter is different than 0. 
There is strong evidence that the linear rela- 
tionship between birth weight and smoking 
is real. 

8.5 (a) bwt = -80.41 + 0.44 * gestation - 
3.33 * parity — 0.01 * age + 1.15 * height + 
0.05 * weight - 8.40 * smoke, (b) ft: The 
model predicts a 0.44 ounce increase in the 
birth weight of the baby for each additional 
day in length of pregnancy, all else held con- 
stant, ft: The model predicts a 0.01 ounce 
decrease in the birth weight of the baby 
for each additional year in mother's age, all 
else held constant, (c) Parity might be cor- 
related with one of the other variables in 
the model, which introduces collinearity and 
complicates model estimation, (d) -0.58. 
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8.7 (a) R 2 = 1 - (249.28/332.57) = 
0.2504. R 2 adj = 1 - (249.28/(1236 - 6 - 
l))/(332. 57/(1236 - 1)) = 0.2468. 

8.9 (a) There does not appear to be a sig- 
nificant relationship between the age of the 
mother and the birth weight of the baby 
since the p-value for the age variable is rel- 
atively high. We might consider removing 
this variable from the model, (b) No, all 
variables in the model now appear to have 
a significant relationship with the outcome 
therefore we would not need to removed any 
more variables. 

8.11 Based on the p-value alone, either ges- 
tation or smoke should be added to the 
model first. However, since the adjusted 
R 2 for the model with gestation is higher, 
it would be preferable to add gestation in 
the first step of the forward-selection algo- 
rithm. (Other explanations are possible. For 
instance, it would be reasonable to only use 
the adjusted R 2 .) 

8.13 (1) Normality of residuals: The nor- 
mal probability plot shows a nearly straight 
line of points, providing evidence that the 
nearly normal assumption is reasonable. (2) 
Constant variance of residuals: The scatter- 
plot of the absolute values of residuals versus 
the fitted values suggests that there may be 
a few outliers, some with lower than average 
fitted values and some with higher than av- 
erage fitted values. (3) The residuals should 
be independent: The scatterplot of residu- 
als versus the order of data collection shows 
a random scatter, suggesting that this as- 



sumption is met. (4) Each variable should 
be linearly related to the outcome (i.e. we 
don't see any nonlinear trends): No non- 
linear trends are evident. However, there 
are some outliers at the extremes of length 
of gestation and weight of the mother, so 
we should carefully examine these particu- 
lar cases. There is some concern regarding 
constant variance across the parity groups. 

We have two main concerns: outliers 
and constant variance. None of the outliers 
are exceptionally extreme, and there are a 
very large number of observations, so the in- 
fluence of the outliers is probably mitigated 
(though we may want to study them more 
carefully, if possible). Additionally, while 
the constant variance assumption is violated 
across the parity groups, this violation is not 
very extreme. It is probably still reasonable 
to report the results while noting this model 
violation. 

8.15 Based on the side-by-side boxplots 
shown in Exercise 6.19, the constant vari- 
ance assumption appears to be reasonable. 
Because the chicks were randomly assigned 
to their groups (and presumably kept sep- 
arate from one another), independence of 
observations is also reasonable. H Q : fii = 
/Z2 = • • • = M6- Ha- The average weight 
varies across some (or all) groups. F^s = 
15.36 and the p-value is approximately 0. 
With such a small p-value, we reject Hq. 
The data provide strong evidence that the 
average weight of chicks varies across some 
(or all) groups. 



Appendix C 

Distribution tables 



C.l Normal Probability Table 

The area to the left of Z represents the percentile of the observation. The normal probability 
table always lists percentiles. 



negative Z positive Z 

To find the area to the right, calculate 1 minus the area to the left. 

1.0000 - 0.6664 = 0.3336 



For additional details about working with the normal distribution and the normal proba- 
bility table, see Section 3.1. 
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negative Z 



Second decimal place of Z 



0.09 



0.08 



0.07 



0.06 



0.05 



0.04 



0.03 



0.02 



0.01 



0.00 



0.0002 
0.0003 
0.0005 
0.0007 
0.0010 



0.0003 
0.0004 
0.0005 
0.0007 
0.0010 



0.0003 
0.0004 
0.0005 
0.0008 
0.0011 



0.0003 0. 

0.0004 0. 

0.0006 0. 

0.0008 0. 

0.0011 0. 



0003 
0004 
0006 
0008 
0011 



0.0003 
0.0004 
0.0006 
0.0008 
0.0012 



0.0003 
0.0004 
0.0006 
0.0009 
0.0012 



0.0003 0.0003 

0.0005 0.0005 

0.0006 0.0007 

0.0009 0.0009 

0.0013 0.0013 



0.0003 
0.0005 
0.0007 
0.0010 
0.0013 



-3.4 
-3.3 
-3.2 
-3.1 
-3.0 



0.0014 
0.0019 
0.0026 
0.0036 
0.0048 



0.0014 
0.0020 
0.0027 
0.0037 
0.0049 



0.0015 
0.0021 
0.0028 
0.0038 
0.0051 



0.0015 0. 

0.0021 0. 

0.0029 0. 

0.0039 0. 

0.0052 0. 



0016 
0022 
0030 
0040 
0054 



0.0016 
0.0023 
0.0031 
0.0041 
0.0055 



0.0017 
0.0023 
0.0032 
0.0043 
0.0057 



0.0018 0.0018 

0.0024 0.0025 

0.0033 0.0034 

0.0044 0.0045 

0.0059 0.0060 



0.0019 
0.0026 
0.0035 
0.0047 
0.0062 



-2.9 

-2.8 
-2.7 
-2.6 
-2.5 



0.0064 
0.0084 
0.0110 
0.0143 
0.0183 



0.0066 
0.0087 
0.0113 
0.0146 
0.0188 



0.0068 
0.0089 
0.0116 
0.0150 
0.0192 



0.0069 0. 

0.0091 0. 

0.0119 0. 

0.0154 0. 

0.0197 0. 



0071 
0094 
0122 
0158 
0202 



0.0073 
0.0096 
0.0125 
0.0162 
0.0207 



0.0075 
0.0099 
0.0129 
0.0166 
0.0212 



0.0078 0.0080 

0.0102 0.0104 

0.0132 0.0136 

0.0170 0.0174 

0.0217 0.0222 



0.0082 
0.0107 
0.0139 
0.0179 
0.0228 



-2.4 

-2.3 
-2.2 
-2.1 
-2.0 



0.0233 
0.0294 
0.0367 
0.0455 
0.0559 



0.0239 
0.0301 
0.0375 
0.0465 
0.0571 



0.0244 
0.0307 
0.0384 
0.0475 
0.0582 



0.0250 0. 

0.0314 0. 

0.0392 0. 

0.0485 0. 

0.0594 0. 



0256 
0322 
0401 
0495 
0606 



0.0262 
0.0329 
0.0409 
0.0505 
0.0618 



0.0268 
0.0336 
0.0418 
0.0516 
0.0630 



0.0274 0.0281 

0.0344 0.0351 

0.0427 0.0436 

0.0526 0.0537 

0.0643 0.0655 



0.0287 
0.0359 
0.0446 
0.0548 
0.0668 



-1.9 
-1.8 
-1.7 
-1.6 
-1.5 



0.0681 
0.0823 
0.0985 
0.1170 
0.1379 



0.0694 
0.0838 
0.1003 
0.1190 
0.1401 



0.0708 
0.0853 
0.1020 
0.1210 
0.1423 



0.0721 0. 

0.0869 0. 

0.1038 0. 

0.1230 0. 

0.1446 0. 



0735 
0885 
1056 
1251 
1469 



0.0749 
0.0901 
0.1075 
0.1271 
0.1492 



0.0764 
0.0918 
0.1093 
0.1292 
0.1515 



0.0778 0.0793 

0.0934 0.0951 

0.1112 0.1131 

0.1314 0.1335 

0.1539 0.1562 



0.0808 
0.0968 
0.1151 
0.1357 
0.1587 



-1.4 
-1.3 
-1.2 
-1.1 
-1.0 



0.1611 
0.1867 
0.2148 
0.2451 
0.2776 



0.1635 
0.1894 
0.2177 
0.2483 
0.2810 



0.1660 
0.1922 
0.2206 
0.2514 
0.2843 



0.1685 0. 

0.1949 0. 

0.2236 0. 

0.2546 0. 

0.2877 0. 



1711 
1977 
2266 
2578 
2912 



0.1736 
0.2005 
0.2296 
0.2611 
0.2946 



0.1762 
0.2033 
0.2327 
0.2643 
0.2981 



0.1788 0.1814 

0.2061 0.2090 

0.2358 0.2389 

0.2676 0.2709 

0.3015 0.3050 



0.1841 
0.2119 
0.2420 
0.2743 
0.3085 



-0.9 
-0.8 
-0.7 
-0.6 
-0.5 



0.3121 
0.3483 
0.3859 
0.4247 
0.4641 



0.3156 
0.3520 
0.3897 
0.4286 
0.4681 



0.3192 
0.3557 
0.3936 
0.4325 
0.4721 



0.3228 0 

0.3594 0 

0.3974 0 

0.4364 0 

0.4761 0 



3264 
3632 
4013 
4404 
4801 



0.3300 
0.3669 
0.4052 
0.4443 
0.4840 



0.3336 
0.3707 
0.4090 
0.4483 
0.4880 



0.3372 0.3409 

0.3745 0.3783 

0.4129 0.4168 

0.4522 0.4562 

0.4920 0.4960 



0.3446 
0.3821 
0.4207 
0.4602 
0.5000 



-0.4 
-0.3 
-0.2 
-0.1 
-0.0 



k For Z < -3.50, the probability is less than or equal to 0.0002. 
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positive Z 



Second decimal place of Z 



0.00 



0.01 



0.02 



0.03 



0.04 



0.05 



0.06 



0.07 



0.08 0.09 



0.5000 
0.5398 
0.5793 
0.6179 
0.6554 



0.5040 
0.5438 
0.5832 
0.6217 
0.6591 



0.5080 0.5120 

0.5478 0.5517 

0.5871 0.5910 

0.6255 0.6293 

0.6628 0.6664 



0.5160 
0.5557 
0.5948 
0.6331 
0.6700 



0.5199 0.5239 

0.5596 0.5636 

0.5987 0.6026 

0.6368 0.6406 

0.6736 0.6772 



0.5279 
0.5675 
0.6064 
0.6443 
0.6808 



0.5319 
0.5714 
0.6103 
0.6480 
0.6844 



0.5359 
0.5753 
0.6141 
0.6517 
0.6879 



0.6915 
0.7257 
0.7580 
0.7881 
0.8159 



0.6950 
0.7291 
0.7611 
0.7910 
0.8186 



0.6985 0.7019 

0.7324 0.7357 

0.7642 0.7673 

0.7939 0.7967 

0.8212 0.8238 



0.7054 
0.7389 
0.7704 
0.7995 
0.8264 



0.7088 0.7123 

0.7422 0.7454 

0.7734 0.7764 

0.8023 0.8051 

0.8289 0.8315 



0.7157 
0.7486 
0.7794 
0.8078 
0.8340 



0.7190 
0.7517 
0.7823 
0.8106 
0.8365 



0.7224 
0.7549 
0.7852 
0.8133 
0.8389 



0.8413 
0.8643 
0.8849 
0.9032 
0.9192 



0.8438 
0.8665 
0.8869 
0.9049 
0.9207 



0.8461 0.8485 

0.8686 0.8708 

0.8888 0.8907 

0.9066 0.9082 

0.9222 0.9236 



0.8508 
0.8729 
0.8925 
0.9099 
0.9251 



0.8531 0.8554 

0.8749 0.8770 

0.8944 0.8962 

0.9115 0.9131 

0.9265 0.9279 



0.8577 
0.8790 
0.8980 
0.9147 
0.9292 



0.8599 
0.8810 
0.8997 
0.9162 
0.9306 



0.8621 
0.8830 
0.9015 
0.9177 
0.9319 



0.9332 
0.9452 
0.9554 
0.9641 
0.9713 



0.9345 
0.9463 
0.9564 
0.9649 
0.9719 



0.9357 0.9370 

0.9474 0.9484 

0.9573 0.9582 

0.9656 0.9664 

0.9726 0.9732 



0.9382 
0.9495 
0.9591 
0.9671 
0.9738 



0.9394 0.9406 

0.9505 0.9515 

0.9599 0.9608 

0.9678 0.9686 

0.9744 0.9750 



0.9418 
0.9525 
0.9616 
0.9693 
0.9756 



0.9429 
0.9535 
0.9625 
0.9699 
0.9761 



0.9441 
0.9545 
0.9633 
0.9706 
0.9767 



0.9772 
0.9821 
0.9861 
0.9893 
0.9918 



0.9778 
0.9826 
0.9864 
0.9896 
0.9920 



0.9783 0.9788 

0.9830 0.9834 

0.9868 0.9871 

0.9898 0.9901 

0.9922 0.9925 



0.9793 
0.9838 
0.9875 
0.9904 
0.9927 



0.9798 0.9803 

0.9842 0.9846 

0.9878 0.9881 

0.9906 0.9909 

0.9929 0.9931 



0.9808 
0.9850 
0.9884 
0.9911 
0.9932 



0.9812 
0.9854 
0.9887 
0.9913 
0.9934 



0.9817 
0.9857 
0.9890 
0.9916 
0.9936 



0.9938 
0.9953 
0.9965 
0.9974 
0.9981 



0.9940 
0.9955 
0.9966 
0.9975 
0.9982 



0.9941 0.9943 

0.9956 0.9957 

0.9967 0.9968 

0.9976 0.9977 

0.9982 0.9983 



0.9945 
0.9959 
0.9969 
0.9977 
0.9984 



0.9946 0.9948 

0.9960 0.9961 

0.9970 0.9971 

0.9978 0.9979 

0.9984 0.9985 



0.9949 
0.9962 
0.9972 
0.9979 
0.9985 



0.9951 
0.9963 
0.9973 
0.9980 
0.9986 



0.9952 
0.9964 
0.9974 
0.9981 
0.9986 



0.9987 
0.9990 
0.9993 
0.9995 
0.9997 



0.9987 
0.9991 
0.9993 
0.9995 
0.9997 



0.9987 
0.9991 
0.9994 
0.9995 
0.9997 



0.9988 
0.9991 
0.9994 
0.9996 
0.9997 



0.9988 
0.9992 
0.9994 
0.9996 
0.9997 



0.9989 0.9989 

0.9992 0.9992 

0.9994 0.9994 

0.9996 0.9996 

0.9997 0.9997 



0.9989 
0.9992 
0.9995 
0.9996 
0.9997 



0.9990 
0.9993 
0.9995 
0.9996 
0.9997 



0.9990 
0.9993 
0.9995 
0.9997 
0.9998 



*For Z > 3.50, the probability is greater than or equal to 0.9998. 
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C.2 t Distribution Table 



I 1 1 1 1 1 1 I 1 1 1 1 1 1 I 1 1 1 1 1 1 

— 3 — 2 — 1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 

One tail One tail Two tails 
Figure C.l: Three t distributions. 
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Chi-Square Probability Table 




0 5 10 15 



Figure C.2: Areas in the chi-square table always refer to the right tail. 
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Addition Rule, 57 


control group, 2, 36 


adjusted R 2 , 313 


convenience sample, 29 


alternative hypothesis {Ha), 156 


correlation, 280 


analysis of variance (ANOVA), 294, 321 




anecdotal evidence, 27 


data, 1 


associated, 8 


data density, 11 




data fashing, 324 


backward-elimination, 315 


data matrix, 4 


bar plot, 20 


data snooping, 324 


Bayesian statistics, 81 


deck of cards, 59 


Bernoulli random variable, 118 


degrees of freedom, 313 


bias, 2o 


degrees ot freedom (at J, 242 


bimodal, 12 


degrees ot freedom (at J, 214 


binomial distribution, 122 


density, 68 


blind, 3d 


dependent, 8, 29 


blocking, 34 


deviation, 13 


blocks, 34 


df, see degrees of freedom (df) 


Bonferroni correction, 331 


discrete, 5 




disjoint, 57 


case, 3 


distribution, 10, 68 


categorical, 5 


double-blind, 36 


chi-square distribution, 214 




chi-square probability table, 215 


error, 146 


cluster sample, 32 


events, 58 


clusters, 32 


expected value, 84 


cohort, 30 


experiment, 30 


collections, 58 


experiments, 34 


collinear, 312 


explanatory, 29 


column proportion, 20 


exponentially, 119 


column totals, 20 


extrapolation, 287 


complement, 62 




condition, 72 


F test, 326 


conditional probability, 71 


face card, 59 


confidence interval, 149 


factorial, 122 


confident, 149 


failure, 118 


confounder, 31 


first quartile, 15 


confounding factor, 31 


forward-selection, 316 


confounding variable, 31 


frequency table, 20 


contingency table, 20 


full model, 314 


continuous, 5 




control, 34 


General Addition Rule, 60 
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generalized linear models, 132 

high leverage, 290 
histogram, 11 
hollow histograms, 25 
hypotheses, 156 

independent, 8, 29, 64 

independent and identically distributed (iid) , 
119 

influential point, 290 
interquartile range, 15 
interquartile range (IQR), 15 

joint probability, 70 

Law of Large Numbers, 56 
least squares criterion, 283 
least squares line, 283 
left skewed, 12 
levels, 5 

linear combination, 88 
long tail, 12 
lurking variable, 31 

margin of error, 151, 173 
marginal probabilities, 70 
mean, 10 

mean square between groups (MSG), 325 

mean square error (MSE), 326 

median, 15, 16 

midterm election, 290 

mode, 12 

mosaic plot, 22 

multimodal, 12 

multiple comparisons, 331 

multiple regression, 311 

mutually exclusive, 57 

n choose k, 122 

negative binomial distribution, 128 
negatively associated, 8 
nominal, 5 
non-response, 29 
non-response bias, 29 
normal curve, 104 
normal distribution, 104 
normal probability plot, 114 
normal probability table, 107 
null hypothesis (Hq), 156 
null value, 157 



numerical, 4 

observational study, 30 
observational unit, 3 
one-sided, 162 
one-way ANOVA, 332 
ordinal, 5 
outcome, 2, 56 
outlier, 17 
outliers, 17 

p- value, 161 

paired, 192 

parameters, 105 

patients, 36 

percentile, 15, 107 

permutation test, 261 

pie chart, 24 

placebo, 2, 36 

placebo effect, 2, 36 

point estimate, 144 

point-slope, 285 

Poisson distribution, 131 

pooled estimate, 210 

pooled standard deviation, 254 

population, 9, 26 

population mean, 144 

population parameters, 145 

positive association, 8 

power, 178 

practically significant, 179 
predictor, 274 
primary, 76 
probability, 56 

probability density function, 68 

probability distribution, 61 

probability of a success, 118 

probability sample, see sample 

Product Rule for independent processes, 65 

prosecutor's fallacy, 325 

prospective study, 31 

quant ile- quant ile plot, 114 

R-squared, 288 
random process, 56 
random variable, 83 
randomization technique, 38 
randomized experiment, 30, 34 
rate, 132 
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relative frequency table, 20 
replicate, 34 
representative, 29 
residual plot, 279 
residuals, 277 
response, 29 
retrospective studies, 31 
right skewed, 12 
robust estimates, 18 
row proportions, 20 
row totals, 20 
running mean, 145 

sample, 9, 26 
sample mean, 144 
sample proportion, 118 
sample space, 62 
sampling distribution, 146 
sampling variation, 144 
scatterplot, 6, 9 
secondary, 76 
segmented bar plot, 22 
sets, 58 

side-by-side box plot, 25 
significance level, 160 
simple random sample, 28 
simulation, 38 
skewed to the high end, 12 
skewed to the positive end, 12 
skewed to the right, 12 
standard deviation, 13, 86 
standard error (SE), 146 
statistically significant, 179 
stepwise, 315 
strata, 32 

stratified sampling, 32 
study participants, 36 
success, 118 

success-failure condition, 202 
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